thr3ads.net - llvm dev - [llvm-dev] [RFC] llvm-mca: a static performance analysis tool [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Andrea Di Biagio via llvm-dev

2018-Mar-01 17:22 UTC

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Hi all,

At Sony we developed an LLVM based performance analysis tool named
llvm-mca. We
currently use it internally to statically measure the performance of code,
and
to help triage potential problems with target scheduling models. We
decided to
post this RFC because we are interested in the feedback from the community,
and
we also believe that other people might be interested in a tool like this.

llvm-mca uses information which is already available in LLVM (e.g.
scheduling
models) to statically measure the performance of machine code in a specific
cpu.
Performance is measured in terms of throughput as well as processor resource
consumption. The tool currently works for processors with an out-of-order
backend, for which there is a scheduling model available in LLVM.

The main goal of this tool is not just to predict the performance of the
code
when run on the target, but also help with diagnosing potential performance
issues.

Given an assembly code sequence, llvm-mca estimates the IPC (instructions
per
cycle), as well as hardware resources pressure. The analysis and reporting
style
were inspired by the IACA tool from Intel.

The presence of long data dependency chains, as well as poor usage of
hardware
resources may lead to bottlenecks in the back-end. The tool is able to
generate
a detailed report which should help with identifying and analyzing sources
of
bottlenecks.

Scheduling models are mostly used to compute instruction latencies, to
obtain
read-advance information, and understand how processor resources are used by
instructions. By design, the quality of the performance analysis conducted
by
the tool is inevitably affected by the quality of the target scheduling
models.

However, scheduling models intentionally do not describe all processors
details,
since the goal is just to enable the scheduling of machine instructions
during
compilation. That means, there are processor details which are not
important for
the purpose of scheduling instructions (and therefore not described by the
scheduling model), but are very important for this tool.

A few examples of details that are missing in scheduling models are:
- Maximum number of instructions retired per cycle.
- Actual dispatch width (it often differs from the issue width).
- Number of temporary registers available for renaming.
- Number of read/write ports in the register file(s).
- Length of the load/store queue in the LSUnit.

It is also very difficult to find a "good" abstract model to describe
the
behavior of out-of-order processors. So, we have to keep in mind that all of
these aspects are going to affect the quality of the static analysis
performed
by the tool.

An extensive list of known limitations is reported in one of the last
sections
of this document. There is also a section related to design problems which
must
be addressed (hopefully with the help of the community). At the moment, the
tool has been mostly tested for x86 targets, but there are still several
limitations, some of which could be overcome by integrating extra
information
into the scheduling models.

As was mentioned before, this tool has been (and is still being) used
internally
in Sony to debug/triage issues in the btver2 scheduling model. We have also
tested it on other targets to check how generic the tool is. In our
experience,
the tool makes it easy to identify simple mistakes like "wrong number of
micro
opcodes specified for an instruction", or "wrong set of hardware
resources".
Some of these mistakes are quite common (sometimes on mature models too),
and
often difficult to catch. Reports generated by this tool are simple to
analyze,
and contain enough details to help triage most performance problems.

1. How the tool works
---------------------

The tool takes assembly code as input. Assembly code is parsed into a
sequence
of MCInst with the help of the existing LLVM target assembly parsers. The
parsed
sequence of MCInst is then analyzed by a 'Backend' module to generate a
performance report.

The Backend module internally emulates the execution of the machine code
sequence in a loop of iterations (which by default is 70). At the end of
this
process, the backend collects a number of statistics which are then printed
out
in the form of a report.

Here is an example of performance report generated by the tool for a
dot-product
of two packed float vectors of four elements. The analysis is conducted for
target x86, cpu btver2:

///////////////////

Iterations: 300
Instructions: 900
Total Cycles: 610
Dispatch Width: 2
IPC: 1.48

Resources:
[0] - JALU0
[1] - JALU1
[2] - JDiv
[3] - JFPM
[4] - JFPU0
[5] - JFPU1
[6] - JLAGU
[7] - JSAGU
[8] - JSTC
[9] - JVIMUL

Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - - - 2.00 1.00 - - - -

Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Instructions:
- - - - - 1.00 - - - -
vmulps %xmm0, %xmm1, %xmm2
- - - - 1.00 - - - - -
vhaddps %xmm2, %xmm2, %xmm3
- - - - 1.00 - - - - -
vhaddps %xmm3, %xmm3, %xmm4

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects

[1] [2] [3] [4] [5] [6] Instructions:
1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4

///////////////////

According to this report, the dot-product kernel has been executed 300
times,
for a total of 900 instructions dynamically executed.

The report is structured in three main sections. A first section collects
a few
performance numbers; the goal of this section is to give a very quick
overview
of the performance throughput. In this example, the two important perforamce
indicators are a) the predicted total number of cycles, and b) the IPC.
IPC is probably the most important throughput indicator. A big delta
between the
Dispatch Width and the computed IPC is an indicator of potential performance
issues.

The second section is the so-called "resource pressure view". This
view
reports
the average number of resource cycles consumed every iteration by
instructions
for every processor resource unit available on the target. Information is
structured in two tables. The first table reports the number of resource
cycles
spent on average every iteration. The second table correlates the resource
cycles to the machine instruction in the sequence. For example, every
iteration
of the dot-product, instruction 'vmulps' always executes on resource
unit
[5]
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource
cycle
per iteration. Note that on Jaguar, vector FP multiply can only be issued
to
pipeline JFPU1, while horizontal FP adds can only be issued to pipeline
JFPU0.

The third (and last) section of the report shows the latency and reciprocal
throughput of every instruction in the sequence. That section also reports
extra
information related to the number of micro opcodes, and opcode properties
(i.e.
'MayLoad', 'MayStore' and 'UnmodeledSideEffects').

The resource pressure view helps with identifying bottlenecks caused by high
usage of specific hardware resources. Situations with resource pressure
mainly
concentrated on a few resources should, in general, be avoided. Ideally,
pressure should be uniformly distributed between multiple resources.

Timeline View
-------------

A detailed report of each instruction's state transitions over time can be
enabled using the command line flag '-timeline'. This prints an extra
section
in the report which contains the so-called "timeline view". Below is
the
timeline view for the dot-product example from the previous section.

///////////////
Timeline view:
012345
Index 0123456789

[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4

[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4

[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4

Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

[0] [1] [2] [3]
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
///////////////

The timeline view is very interesting because it shows how instructions
changed
in state during execution. It also gives an idea of how the tool
"sees"
instructions executed on the target.

The timeline view is structured in two tables. The first table shows how
instructions change in state over time (measured in cycles); the second
table
(named "Average Wait times") reports useful timing statistics which
should
help
diagnose performance bottlenecks caused by long data dependencies and
sub-optimal
usage of hardware resources.

An instruction in the timeline view is identified by a pair of indices,
where
the 'first' index identifies an iteration, and the 'second'
index is the
actual
instruction index (i.e. where it appears in the code sequence).

Excluding the first and last column, the remaining columns are in cycles.
Cycles
are numbered sequentially starting from 0. The following characters are
used to
describe the state of an instruction:

D : Instruction dispatched.
e : Instruction executing.
E : Instruction executed.
R : Instruction retired.
= : Instruction already dispatched, waiting to be executed.
- : Instruction executed, waiting to be retired.

Based on the timeline view from the example, we know that:
- Instruction [1, 0] was dispatched at cycle 1.
- Instruction [1, 0] started executing at cycle 2.
- Instruction [1, 0] reached the write back stage at cycle 4.
- Instruction [1, 0] was retired at cycle 10.

Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to wait
in
the scheduler's queue for the operands to become available. By the time the
vmulps is dispatched, operands are already available, and pipeline JFPU1 is
ready to serve another instruction. So the instruction can be immediately
issued on the JFPU1 pipeline. That is demonstrated by the fact that the
instruction only spent 1cy in the scheduler's queue.

There is a gap of 5 cycles between the write-back stage and the retire
event.
That is because instructions must retire in program order, so [1,0] has to
wait
for [0, 2] to be retired first (i.e it has to wait unti cycle 10).

In the dot-product example, all instructions are in a RAW (Read After Write)
dependency chain. Register %xmm2 written by the vmulps is immediately used
by
the first vhaddps, and register %xmm3 written by the first vhaddps is used
by
the second vhaddps. Long data dependencies negatively affect the ILP
(Instruction Level Parallelism).

In the dot-product example, there are anti-dependencies introduced by
instructions from different iterations. However, those dependencies can be
removed at register renaming stage (at the cost of allocating register
aliases,
and therefore consuming temporary registers).

Table "Average Wait times" helps diagnose performance issues that are
caused by
the presence of long latency instructions and potentially long data
dependencies
which may limit the ILP. Note that the tool by default assumes at least 1cy
between the dispatch event and the issue event.

When the performance is limited by data dependencies and/or long latency
instructions, the number of cycles spent while in the "ready" state is
expected
to be very small when compared with the total number of cycles spent in the
scheduler's queue. So the difference between the two counters is a good
indicator of how big of an impact data dependencies had on the execution of
instructions. When performance is mostly limited by the lack of hardware
resources, the delta between the two counters is small. However, the
number of
cycles spent in the queue tends to be bigger (i.e. more than 1-3cy)
especially
when compared with other low latency instructions.

Extra statistics to further diagnose performance issues.
--------------------------------------------------------

Flag '-verbose' enables extra statistics and performance counters for
the
dispatch logic, the reorder buffer, the retire control unit and the register
file.

Below is an example of verbose output generated by the tool for the
dot-product
example discussed in the previous sections.

///////////////////
Iterations: 300
Instructions: 900
Total Cycles: 610
Dispatch Width: 2
IPC: 1.48

Dynamic Dispatch Stall Cycles:
RAT - Register unavailable: 0
RCU - Retire tokens unavailable: 0
SCHEDQ - Scheduler full: 272
LQ - Load queue full: 0
SQ - Store queue full: 0
GROUP - Static restrictions on the dispatch group: 0

Dispatch Logic - number of cycles where we saw N instructions dispatched:
[# dispatched], [# cycles]
0, 24 (3.9%)
1, 272 (44.6%)
2, 314 (51.5%)

Schedulers - number of cycles where we saw N instructions issued:
[# issued], [# cycles]
0, 7 (1.1%)
1, 306 (50.2%)
2, 297 (48.7%)

Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)

Scheduler's queue usage:
JALU01, 0/20
JFPU01, 18/18
JLSAGU, 0/12
///////////////////

Based on the verbose report, the backend was only able to dispatch two
instructions 51.5% of the time. The dispatch group was limited to one
instruction 44.6% of the cycles, which corresponds to 272 cycles.

If we look at section "Dynamic Dispatch Stall Cycles", we can see how
counter
SCHEDQ reports 272 cycles. Counter SCHEDQ is incremented every time the
dispatch
logic is unable to dispatch a full group of two instructions because the
scheduler's queue is full.

Section "Scheduler's queue usage" shows how the maximum number of
buffer
entries
(i.e. scheduler's queue entries) used at runtime for resource JFPU01
reached its
maximum. Note that AMD Jaguar implements three schedulers:
* JALU01 - A scheduler for ALU instructions
* JLSAGU - A scheduler for address generation
* JFPU01 - A scheduler floating point operations.

The dot-product is a kernel of three floating point instructions (a vector
multiply followed by two horizontal adds). That explains why only the
floating
point scheduler appears to be used according to section "Scheduler's
queue
usage".

A full scheduler's queue is either caused by data dependency chains, or by a
sub-optimal usage of hardware resources. Sometimes, resource pressure can
be
mitigated by rewriting the kernel using different instructions that consume
different scheduler resources. Schedulers with a small queue are less
resilient
to bottlenecks caused by the presence of long data dependencies.

In this example, we can conclude that the IPC is mostly limited by data
dependencies, and not by resource pressure.

LLVM-MCA instruction flow
-------------------------

This section describes the instruction flow through the out-of-order
backend, as
well as the functional units involved in the process.

An instruction goes through a default sequence of stages:
- Dispatch (Instruction is dispatched to the schedulers).
- Issue (Instruction is issued to the processor pipelines).
- Write Back (Instruction is executed, and results are written back).
- Retire (Instruction is retired; writes are architecturally committed).

The tool only models the out-of-order portion of a processor. Therefore, the
instruction fetch and decode stages are not modeled. Performance
bottlenecks in
the frontend are not diagnosed by this tool. The tool assumes that
instructions
have all been decoded and placed in a queue. Also, the tool doesn't know
anything about branch prediction.

Instruction Dispatch
--------------------

During the Dispatch stage, instructions are picked in program order from a
queue
of already decoded instructions, and dispatched in groups to the hardware
schedulers. The dispatch logic is implemented by class DispatchUnit in file
Dispatch.h.

The size of a dispatch group depends on the availability of hardware
resources,
and it cannot exceed the value of field 'DispatchWidth' in class
DispatchUnit.
Note that field DispatchWidth defaults to the value of field
'IssueWidth'
from
the scheduling model.

Users can override the DispatchWidth value with flag
"-dispatch=<N>" (where
'N'
is an unsigned quantity).

An instruction can be dispatched if:
- The size of the dispatch group is smaller than DispatchWidth
- There are enough entries in the reorder buffer
- There are enough temporary registers to do register renaming
- Schedulers are not full.

Scheduling models don't describe register files, and therefore the tool
doesn't
know if there is more than one register file, and how many temporaries are
available for register renaming.

By default, the tool (optimistically) assumes a single register file with an
unbounded number of temporary registers. Users can limit the number of
temporary registers available for register renaming using flag
`-register-file-size=<N>`, where N is the number of temporaries. A value
of
zero for N means 'unbounded'. Knowing how many temporaries are
available
for
register renaming, the tool can predict dispatch stalls caused by the lack
of
temporaries.

The number of reorder buffer entries consumed by an instruction depends on
the
number of micro-opcodes it specifies in the target scheduling model (see
field
'NumMicroOpcodes' of tablegen class ProcWriteResources and its derived
classes;
TargetSchedule.td).

The reorder buffer is implemented by class RetireControlUnit (see
Dispatch.h).
Its goal is to track the progress of instructions that are
"in-flight", and
retire instructions in program order. The number of entries in the reorder
buffer defaults to the value of field 'MicroOpBufferSize' from the
target
scheduling model.

Instructions that are dispatched to the schedulers consume scheduler buffer
entries. The tool queries the scheduling model to figure out the set of
buffered resources consumed by an instruction. Buffered resources are
treated
like "scheduler" resources, and the field 'BufferSize' (from
the processor
resource tablegen definition) defines the size of the scheduler's queue.

Zero latency instructions (for example NOP instructions) don't consume
scheduler
resources. However, those instructions still reserve a number of slots in
the
reorder buffer.

Instruction Issue
-----------------

As mentioned in the previous section, each scheduler resource implements a
queue
of instructions. An instruction has to wait in the scheduler's queue until
input register operands become available. Only at that point, does the
instruction becomes eligible for execution and may be issued (potentially
out-of-order) to a pipeline for execution.

Instruction latencies can be computed by the tool with the help of the
scheduling model; latency values are defined by the scheduling model through
ProcWriteResources objects.

Class Scheduler (see file Scheduler.h) knows how to emulate multiple
processor
schedulers. A Scheduler is responsible for tracking data dependencies, and
dynamically select which processor resources are consumed/used by
instructions.

Internally, the Scheduler class delegates the management of processor
resource
units and resource groups to the ResourceManager class. ResourceManager is
also
responsible for selecting resource units that are effectively consumed by
instructions. For example, if an instruction consumes 1cy of a resource
group,
the ResourceManager object selects one of the available units from the
group; by
default, it uses a round-robin selector to guarantee that resource usage is
uniformly distributed between all units of a group.

Internally, class Scheduler implements three instruction queues:
- WaitQueue: a queue of instructions whose operands are not ready yet.
- ReadyQueue: a queue of instructions ready to execute.
- IssuedQueue: a queue of instructions executing.

Depending on the operands availability, instructions that are dispatched to
the
Scheduler are either placed into the WaitQueue or into the ReadyQueue.

Every cycle, class Scheduler checks if instructions can be moved from the
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
issued to the underlying pipelines. The algorithm prioritizes older
instructions over younger instructions.

Objects of class ResourceState (see Scheduler.h) describe processor
resources.
There is an instance of class ResourceState for each single processor
resource
specified by the scheduling model. A ResourceState object for a processor
resource with multiple units dynamically tracks the availability of every
single
unit. For example, the ResourceState of a resource group tracks the
availability of every resource in that group. Internally, ResourceState
implements a round-robin selector to dynamically pick the next unit to use
from
the group.

Write-Back and Retire Stage
---------------------------

Issued instructions are moved from the ReadyQueue to the IssuedQueue.
There,
instructions wait until they reach the write-back stage. At that point,
they
get removed from the queue and the retire control unit is notified.

On the event of "instruction executed", the retire control unit flags
the
instruction as "ready to retire".

Instruction are retired in program order; an "instruction retired"
event is
sent
to the register file which frees the temporary registers allocated for the
instruction at register renaming stage.

Load/Store Unit and Memory Consistency Model
--------------------------------------------

The tool attempts to emulate out-of-order execution of memory operations.
Class
LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
for
speculative execution of loads and stores.

Each load (or store) consumes an entry in the load (or store) queue. The
number
of slots in the load/store queues is unknown by the tool, since there is no
mention of it in the scheduling model. In practice, users can specify flag
`-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
to be
equal to exactly N (an unsigned value). If N is zero, then the tool assumes
an
unbounded queue (this is the default).

LSUnit implements a relaxed consistency model for memory loads and stores.
The
rules are:
1) A younger load is allowed to pass an older load only if there is no
intervening store in between the two loads.
2) An younger store is not allowed to pass an older store.
3) A younger store is not allowed to pass an older load.
4) A younger load is allowed to pass an older store provided that the load
does
not alias with the store.

By default, this class conservatively (i.e. pessimistically) assumes that
loads
always may-alias store operations. Essentially, this LSUnit doesn't
perform any
sort of alias analysis to rule out cases where loads and stores don't
overlap
with each other. The downside of this approach however is that younger
loads are
never allowed to pass older stores. To make it possible for a younger load
to
pass an older store, users can use the command line flag -noalias. Under
'noalias', a younger load is always allowed to pass an older store.

Note that, in the case of write-combining memory, rule 2. could be relaxed
a bit
to allow reordering of non-aliasing store operations. That being said, at
the
moment, there is no way to further relax the memory model (flag -noalias is
the
only option). Essentially, there is no option to specify a different memory
type (for example: write-back, write-combining, write-through; etc.) and
consequently to weaken or strengthen the memory model.

Other limitations are:
* LSUnit doesn't know when store-to-load forwarding may occur.
* LSUnit doesn't know anything about the cache hierarchy and memory types.
* LSUnit doesn't know how to identify serializing operations and memory
fences.

No assumption is made on the store buffer size. As mentioned before, LSUnit
conservatively assumes a may-alias relation between loads and stores, and it
doesn't attempt to identify cases where store-to-load forwarding would
occur in
practice.

LSUnit doesn't attempt to predict whether a load or store hits or misses
the L1
cache. It only knows if an instruction "MayLoad" and/or
"MayStore". For
loads,
the scheduling model provides an "optimistic" load-to-use latency
(which
usually
matches the load-to-use latency for when there is a hit in the L1D).

Class MCInstrDesc in LLVM doesn't know about serializing operations, nor
memory-barrier like instructions. LSUnit conservatively assumes that an
instruction which has both 'MayLoad' and 'UnmodeledSideEffects'
behaves
like a
"soft" load-barrier. That means, it serializes loads without forcing
a
flush of
the load queue. Similarly, instructions flagged with both 'MayStore'
and
'UnmodeledSideEffects' are treated like store barriers. A full memory
barrier
is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'.
This is
inaccurate, but it is the best that we can do at the moment with the current
information available in LLVM.

A load/store barrier consumes one entry of the load/store queue. A
load/store
barrier enforces ordering of loads/stores. A younger load cannot pass a
load
barrier. Also, a younger store cannot pass a store barrier. A younger
load has
to wait for the memory/load barrier to execute. A load/store barrier is
"executed" when it becomes the oldest entry in the load/store
queue(s). That
also means, by construction, all the older loads/stores have been executed.

In conclusion the full set of rules is:
1. A store may not pass a previous store.
2. A load may not pass a previous store unless flag 'NoAlias' is set.
3. A load may pass a previous load.
4. A store may not pass a previous load (regardless of flag
'NoAlias').
5. A load has to wait until an older load barrier is fully executed.
6. A store has to wait until an older store barrier is fully executed.

Known limitations
-----------------
Previous sections described cases where the tool is missing information to
give
an accurate report. For example, the first sections of this document
explained
how the lack of knowledge about the processor negatively affects the
performance
analysis. The lack of knowledge is often a consequence of how scheduling
models
are defined; as mentioned before, scheduling models intentionally don't
describe
processors in fine details.

The accuracy of the performance analysis is also affected by assumptions
made by
the processor model used by the tool.

Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache
in
the hardware frontend to speedup the throughput in the presence of tight
loops.
The presence of these buffers complicates the decoding logic, and requires
knowledge on the branch predictor too. Class 'SchedMachineModel' in
tablegen
provides a field named 'LoopMicroOpBufferSize' which is used to describe
loop
buffers. However, the purpose of that field is to enable loop unrolling of
tight loops; essentially, it affects the cost model used by pass
loop-unroll.

By design, the tool only cares about the out-of-order portion of a
processor,
and consequently doesn't try to predict the frontend throughput.
Processors may
implement complex decoding schemes; statically predicting the frontend
throughput is in general beyond the scope of this tool. For the same
reasons,
this tool intentionally doesn't model branch prediction. That being said,
this
tool could be definitely extended in future to also account for the hardware
frontend when doing performance analysis. This would inevitably require
extra
(extensive) processor knowledge related to all the available decoding paths
in
the hardware frontend.

When computing the IPC, the tool assumes a zero-latency "perfect"
fetch&decode
stage; the full sequence of decoded instructions is immediately visible to
the
dispatch logic from the start.

The tool doesn't know about simultaneous mutithreading. According to the
tool,
processor resources are not statically/dynamically partitioned. Processor
resources are fully available to the hardware thread executing the
microbenchmark.

The execution model implemented by this tool assumes that instructions are
firstly dispatched in groups to hardware schedulers, and then issued to
pipelines for execution. The model assumes dynamic scheduling of
instructions.
Instructions are placed in a queue and potentially executed out-of-order
(based
on the operand availability). The dispatch stage is definitely distinct
from the
issue stage.

This model doesn't correctly describe processors where the dispatch/issue
is a
single stage. This is what happens for example in VLIW processors, where
instructions are packaged and statically scheduled at compile time; it is
up to
the compiler to predict the latency of instructions and package issue groups
accordingly. For such targets, there is no dynamic scheduling done by the
hardware.

Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
to
support processors with a single dispatch/issue stage. The execution flow
would
require some changes in the way how existing components (i.e. DispatchUnit,
Scheduler, etc.) interact. This can be a future development.

The following sections describes other known limitations. The goal is not
to
provide an extensive list of limitations; we want to report what we believe
are
the most important limitations, and suggest possible methods to overcome
them.

Load/Store barrier instructions and serializing operations
----------------------------------------------------------
Section "Load/Store Unit and Memory Consistency Model" already
mentioned how
LLVM doesn't know about serializing operations and memory barriers. Most
of it
boils down to the fact that class MCInstrDesc (intentionally) doesn't expose
those properties. Instead, both serializing operations and memory barriers
"have side-effects" according to MCInstrDesc. That is because, at
least for
scheduling purposes, knowing that an instruction has unmodeled side effects
is
often enough to treat the instruction like a compiler scheduling barrier.

A performance analysis tool could use the extra knowledge on barriers and
serializing operations to generate a more accurate performance report. One
way
to improve this is by reserving a couple of bits in field 'Flags' from
class
MCInstrDesc: one bit for barrier operations, and another bit to mark
instructions as serializing operations.

Lack of support for instruction itineraries
-------------------------------------------
The current version of the tool doesn't know how to process instruction
itineraries. This is probably one of the most important limitations, since
it
affects a few out-of-order processors in LLVM.

As mentioned in section 'Instruction Issue', class Scheduler delegates
to an
instance of class ResourceManager the handling of processor resources.
ResourceManager is where most of the scheduling logic is implemented.

Adding support for instruction itineraries requires that we teach
ResourceManager how to handle functional units and instruction stages. This
development can be a future extension, and it would probably require a few
changes to the ResourceManager interface.

Instructions that affect control flow are not correctly modeled
---------------------------------------------------------------
Examples of instructions that affect the control flow are: return, indirect
branches, calls, etc. The tool doesn't try to predict/evaluate branch
targets.
In particular, the tool doesn't model any sort of branch prediction, nor
does it
attempt to track changes to the program counter. The tool always assumes
that
the input assembly sequence is the body of a microbenchmark (a simple loop
executed for a number of iterations). The "next" instruction in
sequence is
always the next instruction to dispatch.

Call instructions default to an arbitrary high latency of 100cy. A warning
is
generated if the tool encounters a call instruction in the sequence. Return
instructions are not evaluated, and therefore control flow is not affected.
However, the tool still queries the processor scheduling model to obtain
latency
information for instructions that affect the control flow.

Possible extensions to the scheduling model
-------------------------------------------
Section "Instruction Dispatch" explained how the tool doesn't know
about the
register files, and temporaries available in each register file for register
renaming purposes.

The LLVM scheduling model could be extended to better describe register
files.
Ideally, scheduling model should be able to define:
- The size of each register file
- How many temporary registers are available for register renaming
- How register classes map to register files

The scheduling model doesn't specify the retire throughput (i.e. how many
instructions can be retired every cycle). Users can specify flag
`-max-retire-per-cycle=<uint>` to limit how many instructions the retire
control
unit can retire every cycle. Ideally, every processor should be able to
specify
the retire throughput (for example, by adding an extra field to the
scheduling
model tablegen class).

Known limitations on X86 processors
-----------------------------------

1) Partial register updates versus full register updates.

On x86-64, a 32-bit GPR write fully updates the super-register. Example:
add %edi %eax ## eax += edi

Here, register %eax aliases the lower half of 64-bit register %rax. On
x86-64,
register %rax is fully updated by the 'add' (the upper half of %rax is
zeroed).
Essentially, it "kills" any previous definition of (the upper half of)
register
%rax.

On the other hand, 8/16 bit register writes only perform a so-called
"partial
register update". Example:
add %di, %ax ## ax += di

Here, register %eax is only partially updated. To be more specific, the
lower
half of %eax is set, and the upper half is left unchanged. There is also no
change in the upper 48 bits of register %rax.

To get accurate performance analysis, the tool has to know which
instructions
perform a partial register update, and which instructions fully update the
destination's super-register.

One way to expose this information is (again) via tablegen. For example, we
could add a flag in the tablegen instruction class to tag instructions that
perform partial register updates. Something like this: 'bit
hasPartialRegisterUpdate = 1'. However, this would force a `let
hasPartialRegisterUpdate = 0` on several instruction definitions.

Another approach is to have a MCSubtargetInfo hook similar to this:
virtual bool updatesSuperRegisters(unsigned short opcode) { return
false; }

Targets will be able to override this method if needed. Again, this is
just an
idea. But the plan is to have this fixed as a future development.

2) Macro Op fusion.

The tool doesn't know about macro-op fusion. On modern x86 processors, a
'cmp/test' followed by a 'jmp' is fused into a single macro
operation. The
advantage is that the fused pair only consumes a single slot in the dispatch
group.

As a future development, the tool should be extended to address
macro-fusion.
Ideally, we could have LLVM generate a table enumerating all the opcode
pairs
that can be fused together. That table could be exposed to the tool via the
MCSubtargetInfo interface. This is just an idea; there may be better ways
to
implement this.

3) Intel processors: mixing legacy SSE with AVX instructions.

On modern Intel processors with AVX, mixing legacy SSE code with AVX code
negatively impacts the performance. The tool is not aware of this issue,
and
the performance penalty is not accounted when doing the analysis. This is
something that we would like to improve in future.

4) Zero-latency register moves and Zero-idioms.

Most modern AMD/Intel processors know how to optimize out register-register
moves and zero idioms at register renaming stage. The tool doesn't know
about these patterns, and this may negatively impact the performance
analysis.

Known design problems
---------------------
This section describes two design issues that are currently affecting the
tool.
The long term plan is to "fix" these issues.

1) Variant instructions not correctly modeled.

The tool doesn't know how to analyze instructions with a "variant"
scheduling
class descriptor. A variant scheduling class needs to be resolved
dynamically.
The "actual" scheduling class often depends on the subtarget, as well
as
properties of the specific MachineInstr object.

Unfortunately, the tool manipulates MCInst, and it doesn't know anything
about
MachineInstr. As a consequence, the tool cannot use the existing machine
subtarget hooks that are normally used to resolve the variant scheduling
class.
This is a major design issue which mostly affects ARM/AArch64 targets. It
mostly boils down to the fact that the existing scheduling framework was
meant
to work for MachineInstr.

When the tool encounters a "variant" instruction, it assumes a generic
1cy
latency. However, the tool would not be able to tell which processor
resources
are effectively consumed by the variant instruction.

2) MCInst and MCInstrDesc.

Performance analysis tools require data dependency information to correctly
predict the runtime performance of the code. This tool must always be able
to
obtain the set of implicit/explicit register defs/uses for every
instruction of
the input assembly sequence.

In the first section of this document, it was mentioned how the tool takes
as
input an assembly sequence. That sequence is parsed into a MCInst sequence
with
the help of assembly parsers available from the targets.

A MCInst is a very low-level instruction representation. The tool can
inspect
the MCOperand sequence of an MCInst to identify register operands. However,
there is no way to tell register operands that are definitions from register
operands that are uses.

In LLVM, class MCInstrDesc is used to fully describe target instructions and
their operands. The opcode of a machine instruction (a MachineInstr object)
can
be used to query the instruction set through method `MCInstrInfo::get' to
obtain
the associated MCInstrDesc object.

However class MCInstrDesc describes properties and operands of MachineInstr
objects. Essentially, MCInstrDesc is not meant to be used to describe MCInst
objects. To be more specific, MCInstrDesc objects are automatically
generated
via tablegen from the instruction set description in the target .td files.
For
example, field `MCInstrDesc::NumDefs' is always equal to the cardinality of
the
`(outs)` set from the tablegen instruction definition.

By construction, register definitions always appear at the beginning of the
MachineOperands list in MachineInstr. Basically, the (outs) are the first
operands of a MachineInstr, and the (ins) will come after in the machine
operand
list. Knowing the number of register definitions is enough to identify
all the register operands that are definitions.

In a normal compilation process, MCInst objects are generated from
MachineInstr
objects through a lowering step. By default the lowering logic simply
iterates
over the machine operands of a MachineInstr, and converts/expands them into
equivalent MCOperand objects.

The default lowering strategy has the advantage of preserving all of the
above mentioned assumptions on the machine operand sequence. That means,
register
definitions would still be at the beginning of the MCOperand sequence, and
register uses would come after.

Targets may still define custom lowering routines for specific opcodes.
Some of
these routines may lower operands in a way that potentially breaks (some
of) the
assumptions on the machine operand sequence which were valid for
MachineInstr.
Luckily, this is not the most common form of lowering done by the targets,
and
the vast majority of the MachineInstr are lowered based on the default
strategy
which preserves the original machine operand sequence. This is especially
true
for x86, where the custom lowering logic always preserves the original (i.e.
from the MachineInstr) operand sequence.

This tool currently works under the strong (and potentially incorrect)
assumption that register def/uses in a MCInst can always be identified by
querying the machine instruction descriptor for the opcode. This assumption
made
it possible to develop this tool and get good numbers at least for the
processors available in the x86 backend.

That being said, the analysis is still potentially incorrect for other
targets.
So we plan (with the help of the community) to find a proper mechanism to
map
when possible MCOperand indices back to MachineOperand indices of the
equivalent
MachineInstr. This would be equivalent to describing changes made by the
lowering step which affected the operand sequence. For example, we could
have an
index for every register MCOperand (or -1, if the operand didn't exist in
the
original MachineInstr). The mapping could look like this <0,1,3,2>. Here,
MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.

This information could be automatically generated via tablegen for all the
instructions whose custom lowering step breaks assumptions made by the tool
on
the register operand sequence (In general, these instructions should be the
minority of a target's instruction set). Unfortunately, we don't have
that
information now. As a consequence, we assume that the number of explicit
register definitions is the same number specified in MCInstrDesc. We also
assume that register definitions always come first in the operand sequence.

In conclusion: these are for now the strong assumptions made by the tool:
* The number of explicit and implicit register definitions in a MCInst
matches the number of explicit and implicit definitions specified by the
MCInstrDesc object.
* Register uses always come after register definitions.
* If an opcode specifies an optional definition, then the optional
definition is always the last register operand in the sequence.

Note that some of the information accessible from the MCInstrDesc is always
valid for MCInst. For example: implicit register defs, implicit register
uses
and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still
apply to
MCInst. The tool knows about this, and uses that information during its
analysis.

What to do next
---------------
The source code has been uploaded for review on phabricator at this link:
https://reviews.llvm.org/D43951.

The review covers two patches:
A first (very small) patch that always enables the generation of processor
resource names in the SubtargetEmitter. Currently, the processor resource
names
are only generated for debugging purposes, but are needed by the tool to
generate user friendly reports, so we would like to always generate them.
A second patch with the actual static analysis tool (in llvm/tools).

Once these first two patches are committed, the plan is to keep working on
the
tool with the help of the community to address all of the limitations
described
by the previous sections, and find good solutions/fixes for the design
issues
described by section "Known design problems".

We hope the community will find this tool useful like we have.

Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
really
helped me a lot by suggesting improvements and testing the tool.

Thanks for your time.
-Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180301/bc692044/attachment-0001.html>

Andrew Trick via llvm-dev

2018-Mar-02 01:16 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

> On Mar 1, 2018, at 9:22 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> 
> Hi all,
> 
> At Sony we developed an LLVM based performance analysis tool named
llvm-mca. We
> currently use it internally to statically measure the performance of code,
and
> to help triage potential problems with target scheduling models.  We
decided to
> post this RFC because we are interested in the feedback from the community,
and
> we also believe that other people might be interested in a tool like this.
This is a dream come true!
> llvm-mca uses information which is already available in LLVM (e.g.
scheduling
> models) to statically measure the performance of machine code in a specific
cpu.
> Performance is measured in terms of throughput as well as processor
resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
> 
> The main goal of this tool is not just to predict the performance of the
code
> when run on the target, but also help with diagnosing potential performance
> issues.
You can’t really have one without the other, which is why this is a dream come
true.
> Given an assembly code sequence, llvm-mca estimates the IPC (instructions
per
> cycle), as well as hardware resources pressure. The analysis and reporting
style
> were inspired by the IACA tool from Intel.
> 
> The presence of long data dependency chains, as well as poor usage of
hardware
> resources may lead to bottlenecks in the back-end.  The tool is able to
generate
> a detailed report which should help with identifying and analyzing sources
of
> bottlenecks.
> 
> Scheduling models are mostly used to compute instruction latencies, to
obtain
> read-advance information, and understand how processor resources are used
by
> instructions.  By design, the quality of the performance analysis conducted
by
> the tool is inevitably affected by the quality of the target scheduling
models.
> 
> However, scheduling models intentionally do not describe all processors
details,
> since the goal is just to enable the scheduling of machine instructions
during
> compilation. That means, there are processor details which are not
important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
The LLVM machine model can have as much detail as we want, as long as it’s
opt-in for targets.
We’ve always had more detail than the generic scheduler actually needed. e.g.
the scheduler doesn’t currently care how big the per-resource buffer sizes are.
Targets have their own scheduling strategies that can pick and choose.

At one point I wrote code that could be called by the scheduler to simulate OOO
execution at much greater detail, but the benefit didn’t justify the complexity
and compile time. I always felt that a static analysis tool was the right place
for this kind of simulation.
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
MicroOpBufferSize is presumed to cover register renaming and retirement,
assuming they are well-balanced. For your tool, you certainly want to be more
precise.
>  - Actual dispatch width (it often differs from the issue width).
This was always a hard one to generalize in a machine independent way, and half
the world seems to swap the meaning of those terms. If you’re going to make this
distinction please define the terms clearly in the machine model and explain how
tools are expected to use the information.

Currently IssueWidth is used to tell the scheduler that ’N' microops will
definitely take ’N’ / ‘IssueWidth’ cycles regardless of which functional units
or dispatch pipeline is involved.

[Reading below, I saw that you define instruction “dispatch" as what the
LLVM machine model calls “issue”. LLVM doesn’t model dispatch directly because,
by its definition, it is redundant with the number of function units, modulo
some dynamic behavior that can’t be statically predicted anyway]

We also don’t model decoding, because, presumably you hit the micro-op limit
first.
>  - Number of temporary registers available for renaming.
See MicroOpBufferSize above.
>  - Number of read/write ports in the register file(s).
The assumption is that we hit micro-op issue width first. I suppose it’s good to
have though.
>  - Length of the load/store queue in the LSUnit.
That was supposed to be covered by per-processor-resource buffer size. Maybe you
want to simulate the load/store queue differently from other functional units?
e.g. one shared queue across multiple load store units?
> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that all
of
> these aspects are going to affect the quality of the static analysis
performed
> by the tool.
Like the scheduler, the tool should be extensible so that different targets can
simulate behavior that doesn’t fit some abstract model.
> An extensive list of known limitations is reported in one of the last
sections
> of this document. There is also a section related to design problems which
must
> be addressed (hopefully with the help of the community).  At the moment,
the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra
information
> into the scheduling models.
> 
> As was mentioned before, this tool has been (and is still being) used
internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have also
> tested it on other targets to check how generic the tool is. In our
experience,
> the tool makes it easy to identify simple mistakes like "wrong number
of micro
> opcodes specified for an instruction", or "wrong set of hardware
resources".
> Some of these mistakes are quite common (sometimes on mature models too),
and
> often difficult to catch.  Reports generated by this tool are simple to
analyze,
> and contain enough details to help triage most performance problems.
> 
> 1. How the tool works
> ——————————
Nice documentation. Please find a good place for it to live and include it in
your patch.
> <snip>
> Timeline View
> -------------
> 
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra section
> in the report which contains the so-called "timeline view". 
Below is the
> timeline view for the dot-product example from the previous section.
> 
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789      
> 
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
> 
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
> 
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
Truly awesome.
> 
> 
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
> 
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the
register
> file.
> 
> Below is an example of verbose output generated by the tool for the
dot-product
> example discussed in the previous sections.
> 
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
> 
> 
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
> 
> 
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
> 
> 
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
> 
> 
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
> 
> 
> Retire Control Unit - number of cycles where we saw N instructions retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
> 
> 
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
So great!
> LLVM-MCA instruction flow
> -------------------------
> 
> This section describes the instruction flow through the out-of-order
backend, as
> well as the functional units involved in the process. 
> 
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally
committed).
> 
> The tool only models the out-of-order portion of a processor. Therefore,
the
> instruction fetch and decode stages are not modeled. Performance
bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that
instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
The following sections are fantastic documentation for your tool and the machine
model in general. I hope you check all this in somewhere.
> Instruction Dispatch
> --------------------
> 
> During the Dispatch stage, instructions are picked in program order from a
queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit in
file
> Dispatch.h.
> 
> The size of a dispatch group depends on the availability of hardware
resources,
> and it cannot exceed the value of field 'DispatchWidth' in class
DispatchUnit.
> Note that field DispatchWidth defaults to the value of field
'IssueWidth' from
> the scheduling model.
> 
> Users can override the DispatchWidth value with flag
"-dispatch=<N>" (where 'N'
> is an unsigned quantity).
> 
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
This is really what’s meant by LLVM’s IssueWidth. I think Intel always preferred
to call it “dispatch”. Presumably, instructions are buffered before being
dispatched/issued to functional units, so LLVM’s “IssueWidth” is meant to model
a hardware restriction that is independent from the number of functional units
(processor resources).

It’s especially confusing because some µArchs have dispatch pipelines decoupled
from the functional units.
> Scheduling models don't describe register files, and therefore the tool
doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
The LLVM machine model can model the size of the register renaming pool if you
like.
> By default, the tool (optimistically) assumes a single register file with
an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A
value of
> zero for N means 'unbounded'.  Knowing how many temporaries are
available for
> register renaming, the tool can predict dispatch stalls caused by the lack
of
> temporaries.
> 
> The number of reorder buffer entries consumed by an instruction depends on
the
> number of micro-opcodes it specifies in the target scheduling model (see
field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived classes;
> TargetSchedule.td).
> 
> The reorder buffer is implemented by class RetireControlUnit (see
Dispatch.h).
> Its goal is to track the progress of instructions that are
"in-flight", and
> retire instructions in program order.  The number of entries in the reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
> 
> Instructions that are dispatched to the schedulers consume scheduler buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are
treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
> 
> Zero latency instructions (for example NOP instructions) don't consume
scheduler
> resources.  However, those instructions still reserve a number of slots in
the
> reorder buffer.
As currently modeled by dummy µOps:

// A single nop micro-op (uX).
def WriteX : SchedWriteRes<[]> { let Latency = 0; }

You’ll need MachineInstr’s though.
> Instruction Issue
> -----------------
> 
> As mentioned in the previous section, each scheduler resource implements a
queue
> of instructions.  An instruction has to wait in the scheduler's queue
until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
> 
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model
through
> ProcWriteResources objects.
> 
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple
processor
> schedulers.  A Scheduler is responsible for tracking data dependencies, and
> dynamically select which processor resources are consumed/used by
instructions.
> 
> Internally, the Scheduler class delegates the management of processor
resource
> units and resource groups to the ResourceManager class.  ResourceManager is
also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a resource
group,
> the ResourceManager object selects one of the available units from the
group; by
> default, it uses a round-robin selector to guarantee that resource usage is
> uniformly distributed between all units of a group.
To be a cross-subtarget tool, this needs to be a customization point. It’s not
always so simple.
> <snip>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
> 
> The tool attempts to emulate out-of-order execution of memory operations. 
Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
for
> speculative execution of loads and stores.
>  
> Each load (or store) consumes an entry in the load (or store) queue.  The
number
> of slots in the load/store queues is unknown by the tool, since there is no
> mention of it in the scheduling model.  In practice, users can specify flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
to be
> equal to exactly N (an unsigned value). If N is zero, then the tool assumes
an
> unbounded queue (this is the default).
> 
> LSUnit implements a relaxed consistency model for memory loads and stores.
The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the load
does
>    not alias with the store.
> 
> By default, this class conservatively (i.e. pessimistically) assumes that
loads
> always may-alias store operations.  Essentially, this LSUnit doesn't
perform any
> sort of alias analysis to rule out cases where loads and stores don't
overlap
> with each other.  The downside of this approach however is that younger
loads are
> never allowed to pass older stores.  To make it possible for a younger load
to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
I’m surprised it isn’t optimistic by default. Being pessimistic hides other
bottlenecks and seems less accurate. I guess consistency with other similar
tools (IACA) should be the aim.
> Note that, in the case of write-combining memory, rule 2. could be relaxed
a bit
> to allow reordering of non-aliasing store operations.  That being said, at
the
> moment, there is no way to further relax the memory model (flag -noalias is
the
> only option).  Essentially, there is no option to specify a different
memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
> 
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory
types.
>  * LSUnit doesn't know how to identify serializing operations and
memory fences.
Combining the static model with a sampled dynamic hardware event profile would
be amazing.
> No assumption is made on the store buffer size.  As mentioned before,
LSUnit
> conservatively assumes a may-alias relation between loads and stores, and
it
> doesn't attempt to identify cases where store-to-load forwarding would
occur in
> practice.
The LSUnits have a buffer size of course. The question is whether you really
need to separately model issuing the instructions to a separate memory pipeline
via a shared queue.
> LSUnit doesn't attempt to predict whether a load or store hits or
misses the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore".  For loads,
> the scheduling model provides an "optimistic" load-to-use latency
(which usually
> matches the load-to-use latency for when there is a hit in the L1D).
You’re optimistic here, which is good, but pessimistic with aliasing.
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects' behaves like a
> "soft" load-barrier.  That means, it serializes loads without
forcing a flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory barrier
> is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'.  This is
> inaccurate, but it is the best that we can do at the moment with the
current
> information available in LLVM.
LLVM *should* have this information. It needs some design work though.
> A load/store barrier consumes one entry of the load/store queue.  A
load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass a
load
> barrier.  Also, a younger store cannot pass a store barrier.  A younger
load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store
queue(s). That
> also means, by construction, all the older loads/stores have been executed.
> 
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
> 
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing information to
give
> an accurate report.  For example, the first sections of this document
explained
> how the lack of knowledge about the processor negatively affects the
performance
> analysis.  The lack of knowledge is often a consequence of how scheduling
models
> are defined; as mentioned before, scheduling models intentionally don't
describe
> processors in fine details.
LLVM’s machine model should be optionally extended to model whatever a static
analysis tool needs. There’s some dynamic behavior that can’t be modeled
statically—predictive structures, the state of the pipeline entering a  loop—but
machine model precision shouldn’t be the limiting factor for your tool.
> The accuracy of the performance analysis is also affected by assumptions
made by
> the processor model used by the tool.
> 
> Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache
in
> the hardware frontend to speedup the throughput in the presence of tight
loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to
describe loop
> buffers.  However, the purpose of that field is to enable loop unrolling of
> tight loops; essentially, it affects the cost model used by pass
loop-unroll.
> 
> By design, the tool only cares about the out-of-order portion of a
processor,
> and consequently doesn't try to predict the frontend throughput. 
Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same
reasons,
> this tool intentionally doesn't model branch prediction.  That being
said, this
> tool could be definitely extended in future to also account for the
hardware
> frontend when doing performance analysis.  This would inevitably require
extra
> (extensive) processor knowledge related to all the available decoding paths
in
> the hardware frontend.
If loops are ever definitely limited by decoder or fetch throughput, it would be
good to know that. You don’t need to model the predictive aspect of it, or “all
the paths”.

You might as well say you don’t need to model issue/dispatch width, or
retirement logic, because it’s decoupled from OOO execution via instruction
buffers. The point of this tool is to tell you that there is definitely a
bottleneck in a particular area of the pipeline.

Would it be more appropriate to say that a simple abstract model of decoding is
too inaccurate for your particular subtarget so there was not enough benefit to
implementing it?
> When computing the IPC, the tool assumes a zero-latency "perfect"
fetch&decode
> stage; the full sequence of decoded instructions is immediately visible to
the
> dispatch logic from the start.
> 
> The tool doesn't know about simultaneous mutithreading.  According to
the tool,
> processor resources are not statically/dynamically partitioned.  Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
> 
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of
instructions.
> Instructions are placed in a queue and potentially executed out-of-order
(based
> on the operand availability). The dispatch stage is definitely distinct
from the
> issue stage.
I wonder why in-order analysis doesn’t mostly fall out as a degenerate case in
your tool. There’s some grey area where in-order processors hardware interlocks
that cause stalls that aren’t explicit in the scheduling groups. Those
processors would still benefit from your tool. Even if that target doesn’t
benefit from the OOO simulation, it would be nice to compare the output of your
tool with the observed performance to find bugs in the model.
> This model doesn't correctly describe processors where the
dispatch/issue is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it is
up to
> the compiler to predict the latency of instructions and package issue
groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
> 
> Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
to
> support processors with a single dispatch/issue stage. The execution flow
would
> require some changes in the way how existing components (i.e. 
DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
Ah…. Extended with future development sounds better.
> The following sections describes other known limitations.  The goal is not
to
> provide an extensive list of limitations; we want to report what we believe
are
> the most important limitations, and suggest possible methods to overcome
them.
> 
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already
mentioned how
> LLVM doesn't know about serializing operations and memory barriers. 
Most of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't
expose
> those properties.  Instead, both serializing operations and memory barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at least for
> scheduling purposes, knowing that an instruction has unmodeled side effects
is
> often enough to treat the instruction like a compiler scheduling barrier.
> 
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance report.  One
way
> to improve this is by reserving a couple of bits in field 'Flags'
from class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
> 
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations, since
it
> affects a few out-of-order processors in LLVM.
I don’t think OOO LLVM targets should be using itineraries. If those targets are
still actively maintained, they could migrate.
> As mentioned in section 'Instruction Issue', class Scheduler
delegates to an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
> 
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction stages. 
This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
> 
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch
targets.
> In particular, the tool doesn't model any sort of branch prediction,
nor does it
> attempt to track changes to the program counter.  The tool always assumes
that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in
sequence is
> always the next instruction to dispatch.
> 
> Call instructions default to an arbitrary high latency of 100cy. A warning
is
> generated if the tool encounters a call instruction in the sequence. 
Return
> instructions are not evaluated, and therefore control flow is not affected.
> However, the tool still queries the processor scheduling model to obtain
latency
> information for instructions that affect the control flow.
By decompiling to MachineInst’s it would be easy to build a mostly complete CFG
and call graph.
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know about the
> register files, and temporaries available in each register file for
register
> renaming purposes.
> 
> The LLVM scheduling model could be extended to better describe register
files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
> 
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the
retire control
> unit can retire every cycle.  Ideally, every processor should be able to
specify
> the retire throughput (for example, by adding an extra field to the
scheduling
> model tablegen class).
> 
> Known limitations on X86 processors
> -----------------------------------
> 
> 1) Partial register updates versus full register updates.
> <snip>
MachineOperand handles this. You just need to create the machine instrs.
> 2) Macro Op fusion.
> 
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
operation.  The
> advantage is that the fused pair only consumes a single slot in the
dispatch
> group. 
> 
> As a future development, the tool should be extended to address
macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the opcode
pairs
> that can be fused together. That table could be exposed to the tool via the
> MCSubtargetInfo interface.  This is just an idea; there may be better ways
to
> implement this.
This is already expressed by subtarget code. That code just needs to be exposed
as a common subtarget interface. Typically the interfaces take MachineInst, but
in this case the opcode is probably sufficient.
> 4) Zero-latency register moves and Zero-idioms.
> 
> Most modern AMD/Intel processors know how to optimize out register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance
analysis.
The machine model has this, but requires proper MachineInstrs.
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting the
tool.
> The long term plan is to "fix" these issues.
> 
> 1) Variant instructions not correctly modeled.
> 
> The tool doesn't know how to analyze instructions with a
"variant" scheduling
> class descriptor. A variant scheduling class needs to be resolved
dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
> 
> Unfortunately, the tool manipulates MCInst, and it doesn't know
anything about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant scheduling
class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework was
meant
> to work for MachineInstr.
There are good reasons for the scheduler to work with MachineInstrs, and any
static analysis tool should work with MachineInstrs for the same reasons...
> 2) MCInst and MCInstrDesc.
> 
> Performance analysis tools require data dependency information to correctly
> predict the runtime performance of the code. This tool must always be able
to
> obtain the set of implicit/explicit register defs/uses for every
instruction of
> the input assembly sequence.
> 
> In the first section of this document, it was mentioned how the tool takes
as
> input an assembly sequence. That sequence is parsed into a MCInst sequence
with
> the help of assembly parsers available from the targets.
> 
> A MCInst is a very low-level instruction representation. The tool can
inspect
> the MCOperand sequence of an MCInst to identify register operands. However,
> there is no way to tell register operands that are definitions from
register
> operands that are uses.
> 
> In LLVM, class MCInstrDesc is used to fully describe target instructions
and
> their operands. The opcode of a machine instruction (a MachineInstr object)
can
> be used to query the instruction set through method `MCInstrInfo::get'
to obtain
> the associated MCInstrDesc object.
> 
> However class MCInstrDesc describes properties and operands of MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe
MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically
generated
> via tablegen from the instruction set description in the target .td files. 
For
> example, field `MCInstrDesc::NumDefs' is always equal to the
cardinality of the
> `(outs)` set from the tablegen instruction definition.
> 
> By construction, register definitions always appear at the beginning of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the machine
operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
> 
> In a normal compilation process, MCInst objects are generated from
MachineInstr
> objects through a lowering step. By default the lowering logic simply
iterates
> over the machine operands of a MachineInstr, and converts/expands them into
> equivalent MCOperand objects.
> 
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That means,
register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
> 
> Targets may still define custom lowering routines for specific opcodes.
Some of
> these routines may lower operands in a way that potentially breaks (some
of) the
> assumptions on the machine operand sequence which were valid for
MachineInstr.
> Luckily, this is not the most common form of lowering done by the targets,
and
> the vast majority of the MachineInstr are lowered based on the default
strategy
> which preserves the original machine operand sequence.  This is especially
true
> for x86, where the custom lowering logic always preserves the original
(i.e.
> from the MachineInstr) operand sequence.
> 
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This assumption
made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
> 
> That being said, the analysis is still potentially incorrect for other
targets.
> So we plan (with the help of the community) to find a proper mechanism to
map
> when possible MCOperand indices back to MachineOperand indices of the
equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we could
have an
> index for every register MCOperand (or -1, if the operand didn't exist
in the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
> 
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the tool
on
> the register operand sequence (In general, these instructions should be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand sequence.
> 
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified by
the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
> 
> Note that some of the information accessible from the MCInstrDesc is always
> valid for MCInst. For example: implicit register defs, implicit register
uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
You just made a very strong argument for building the MachineInstrs before
running mca. So I wonder why you didn’t do that.
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this link:
https://reviews.llvm.org/D43951 <https://reviews.llvm.org/D43951>.
> 
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor resource
names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
> 
> Once these first two patches are committed, the plan is to keep working on
the
> tool with the help of the community to address all of the limitations
described
> by the previous sections, and find good solutions/fixes for the design
issues
> described by section "Known design problems".
> 
> We hope the community will find this tool useful like we have.
> 
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
really
> helped me a lot by suggesting improvements and testing the tool.
> 
> Thanks for your time.
> -Andrea
There are a number of people on llvm-dev who can explain better than I how to
decompile into MachineInstrs. I’m not totally opposed to checking in something
that works with MCInstr, but this does run strongly contrary to the design of
LLVM’s subtarget support.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180301/4bd6c099/attachment-0001.html>

Clement Courbet via llvm-dev

2018-Mar-02 12:33 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Hi Andrea,

Thanks for this great RFC ! I've put some high-level comments here, and
I'll give more focused comments in the review on Phabricator.


On Thu, Mar 1, 2018 at 6:22 PM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
wrote:
> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named
> llvm-mca. We
> currently use it internally to statically measure the performance of code,
> and
> to help triage potential problems with target scheduling models.  We
> decided to
> post this RFC because we are interested in the feedback from the
> community, and
> we also believe that other people might be interested in a tool like this.
>
>We are definitely interested in tools like this. As a matter of fact, we
(Google) have developed a similar tool. Our approach is very similar to the
one described here, so I’m confident that we can merge our tool with the
one proposed here. There are some similarities and differences in the
design, which I’ll quickly go over here. I’ll give my opinions about
respective merits of each approach when discussing specific points below.

The main differences with the design are:

   -

   We simulate the CPU frontend too (parse/decode). We’ve found out that
   some kernels benefit from instruction reorderings that do not change port
   pressure but change how instructions are parsed and decoded (e.g. parse
   window can be a bottleneck).
   -

   Our code is more modular on the subtarget model side. Our simulator is
   divided in four parts:
   -

      CPU components (parser/decoder/renamer/ROB/Issue Port/...) are
      modeled and unit tested individually. These are intended to be mostly
      target-independant.
      -

      Each subtarget describes how to assemble these components together.
      This is where knowledge about the target is injected.
      -

      The framework drives the components in a subtarget-agnostic way.
      Events are logged to a `SimulationLog` object. This is similar to the
      `HWEventListener` interface in llvm-mca.
      -

      Analysis passes analyze the `SimulationLog` to extract whatever
      metric they care about (e.g. port pressure or IPC), or generate an
      annotated trace. A similar functionality is provided in llvm-mca `XXView`
      implementations.

We also have a IACA-like binary that displays analysis results.

For reference, our code can be found here: https://github.com/google/
EXEgesis/tree/master/llvm_sim


> llvm-mca uses information which is already available in LLVM (e.g.
> scheduling
> models) to statically measure the performance of machine code in a
> specific cpu.
> Performance is measured in terms of throughput as well as processor
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of the
> code
> when run on the target, but also help with diagnosing potential performance
> issues.
>


*Another interesting application is for (automated) optimization. We’ve
experimented with using predicted throughput to schedule instructions. This
has resulted in significant wins (see e.g. this PR improving a gemm kernel
<https://github.com/google/gemmlowp/pull/91> and this PR improving WebM
<https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3>).
We think it could also be used to improve instruction selection to offload
some ports (typically un-vectorize some computations on Intel CPUs, where
port5 typically becomes saturated for vector-intensive computations).To
enable this, it is important that: - The design allows for using the tools
as a library, which is the case of both implementations.- Prediction is
reasonably fast since it’s going to be used in a loop. I think that the
listener-based design of llvm-mca is better than our tool for this purpose
because you can only register to listen what you care about and you don’t
need to go through a generic simulation log.*
>
> Given an assembly code sequence, llvm-mca estimates the IPC (instructions
> per
> cycle), as well as hardware resources pressure. The analysis and reporting
> style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able to
> generate
> a detailed report which should help with identifying and analyzing sources
> of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to
> obtain
> read-advance information, and understand how processor resources are used
> by
> instructions.  By design, the quality of the performance analysis
> conducted by
> the tool is inevitably affected by the quality of the target scheduling
> models.
>
> However, scheduling models intentionally do not describe all processors
> details,
> since the goal is just to enable the scheduling of machine instructions
> during
> compilation. That means, there are processor details which are not
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>  - Actual dispatch width (it often differs from the issue width).
>  - Number of temporary registers available for renaming.
>  - Number of read/write ports in the register file(s).
>  - Length of the load/store queue in the LSUnit.
>
*And a few more when simulating the frontend. Note however, that the
frontend actually requires a lot less information because it’s mostly
agnostic to individual instructions. We basically only need the lengths of
a few queues. *
>
> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that all
> of
> these aspects are going to affect the quality of the static analysis
> performed
> by the tool.
>
>*That’s the part where I think that llvm-mca could benefit from more
modularization of the model. The first step would be to split the simulator
driver (the `NotifyXXX` code in `Backend.cpp`) from the "real" backend
logic (essentially the IB, HWS, DU, SM member and the hard-coded code path
that links them), and make the backend customizable by the subtarget. I’ll
detail this in the review.*

> An extensive list of known limitations is reported in one of the last
> sections
> of this document. There is also a section related to design problems which
> must
> be addressed (hopefully with the help of the community).  At the moment,
> the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have also
> tested it on other targets to check how generic the tool is. In our
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number
of
> micro
> opcodes specified for an instruction", or "wrong set of hardware
> resources".
> Some of these mistakes are quite common (sometimes on mature models too),
> and
> often difficult to catch.  Reports generated by this tool are simple to
> analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ---------------------
>
> The tool takes assembly code as input. Assembly code is parsed into a
> sequence
> of MCInst with the help of the existing LLVM target assembly parsers. The
> parsed
> sequence of MCInst is then analyzed by a 'Backend' module to
generate a
> performance report.
>
> The Backend module internally emulates the execution of the machine code
> sequence in a loop of iterations (which by default is 70). At the end of
> this
> process, the backend collects a number of statistics which are then
> printed out
> in the form of a report.
>
> Here is an example of performance report generated by the tool for a
> dot-product
> of two packed float vectors of four elements. The analysis is conducted for
> target x86, cpu btver2:
>
> ///////////////////
>
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Resources:
> [0] - JALU0
> [1] - JALU1
> [2] - JDiv
> [3] - JFPM
> [4] - JFPU0
> [5] - JFPU1
> [6] - JLAGU
> [7] - JSAGU
> [8] - JSTC
> [9] - JVIMUL
>
>
> Resource pressure per iteration:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
>  -      -      -      -     2.00   1.00    -      -      -      -
>
> Resource pressure by instruction:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
> Instructions:
>  -      -      -      -      -     1.00    -      -      -      -
> vmulps    %xmm0, %xmm1, %xmm2
>  -      -      -      -     1.00    -      -      -      -      -
> vhaddps    %xmm2, %xmm2, %xmm3
>  -      -      -      -     1.00    -      -      -      -      -
> vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Instruction Info:
> [1]: #uOps
> [2]: Latency
> [3]: RThroughput
> [4]: MayLoad
> [5]: MayStore
> [6]: HasSideEffects
>
> [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
>  1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2
>  1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3
>  1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4
>
> ///////////////////
>
> According to this report, the dot-product kernel has been executed 300
> times,
> for a total of 900 instructions dynamically executed.
>
> The report is structured in three main sections.  A first section collects
> a few
> performance numbers; the goal of this section is to give a very quick
> overview
> of the performance throughput. In this example, the two important
> perforamce
> indicators are a) the predicted total number of cycles, and b) the IPC.
> IPC is probably the most important throughput indicator. A big delta
> between the
> Dispatch Width and the computed IPC is an indicator of potential
> performance
> issues.
>
> The second section is the so-called "resource pressure view". 
This view
> reports
> the average number of resource cycles consumed every iteration by
> instructions
> for every processor resource unit available on the target.  Information is
> structured in two tables. The first table reports the number of resource
> cycles
> spent on average every iteration. The second table correlates the resource
> cycles to the machine instruction in the sequence. For example, every
> iteration
> of the dot-product, instruction 'vmulps' always executes on
resource unit
> [5]
> (JFPU1 - floating point pipeline #1), consuming an average of 1 resource
> cycle
> per iteration.  Note that on Jaguar, vector FP multiply can only be issued
> to
> pipeline JFPU1, while horizontal FP adds can only be issued to pipeline
> JFPU0.
>
> The third (and last) section of the report shows the latency and reciprocal
> throughput of every instruction in the sequence. That section also reports
> extra
> information related to the number of micro opcodes, and opcode properties
> (i.e.
> 'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
>
> The resource pressure view helps with identifying bottlenecks caused by
> high
> usage of specific hardware resources.  Situations with resource pressure
> mainly
> concentrated on a few resources should, in general, be avoided.  Ideally,
> pressure should be uniformly distributed between multiple resources.
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra
> section
> in the report which contains the so-called "timeline view". 
Below is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Average Wait times (based on the timeline view):
> [0]: Executions
> [1]: Average time spent waiting in a scheduler's queue
> [2]: Average time spent waiting in a scheduler's queue while ready
> [3]: Average time elapsed from WB until retire stage
>
>       [0]    [1]    [2]    [3]
> 0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2
> 1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3
> 2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4
> ///////////////
>
> The timeline view is very interesting because it shows how instructions
> changed
> in state during execution.  It also gives an idea of how the tool
"sees"
> instructions executed on the target.
>
> The timeline view is structured in two tables.  The first table shows how
> instructions change in state over time (measured in cycles); the second
> table
> (named "Average Wait times") reports useful timing statistics
which should
> help
> diagnose performance bottlenecks caused by long data dependencies and
> sub-optimal
> usage of hardware resources.
>
> An instruction in the timeline view is identified by a pair of indices,
> where
> the 'first' index identifies an iteration, and the 'second'
index is the
> actual
> instruction index (i.e. where it appears in the code sequence).
>
> Excluding the first and last column, the remaining columns are in cycles.
> Cycles
> are numbered sequentially starting from 0.  The following characters are
> used to
> describe the state of an instruction:
>
>  D : Instruction dispatched.
>  e : Instruction executing.
>  E : Instruction executed.
>  R : Instruction retired.
>  = : Instruction already dispatched, waiting to be executed.
>  - : Instruction executed, waiting to be retired.
>
> Based on the timeline view from the example, we know that:
>   - Instruction [1, 0] was dispatched at cycle 1.
>   - Instruction [1, 0] started executing at cycle 2.
>   - Instruction [1, 0] reached the write back stage at cycle 4.
>   - Instruction [1, 0] was retired at cycle 10.
>
> Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to
> wait in
> the scheduler's queue for the operands to become available. By the time
the
> vmulps is dispatched, operands are already available, and pipeline JFPU1 is
> ready to serve another instruction.  So the instruction can be immediately
> issued on the JFPU1 pipeline. That is demonstrated by the fact that the
> instruction only spent 1cy in the scheduler's queue.
>
> There is a gap of 5 cycles between the write-back stage and the retire
> event.
> That is because instructions must retire in program order, so [1,0] has to
> wait
> for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
>
> In the dot-product example, all instructions are in a RAW (Read After
> Write)
> dependency chain.  Register %xmm2 written by the vmulps is immediately
> used by
> the first vhaddps, and register %xmm3 written by the first vhaddps is used
> by
> the second vhaddps.  Long data dependencies negatively affect the ILP
> (Instruction Level Parallelism).
>
> In the dot-product example, there are anti-dependencies introduced by
> instructions from different iterations.  However, those dependencies can be
> removed at register renaming stage (at the cost of allocating register
> aliases,
> and therefore consuming temporary registers).
>
> Table "Average Wait times" helps diagnose performance issues that
are
> caused by
> the presence of long latency instructions and potentially long data
> dependencies
> which may limit the ILP.  Note that the tool by default assumes at least
> 1cy
> between the dispatch event and the issue event.
>
> When the performance is limited by data dependencies and/or long latency
> instructions, the number of cycles spent while in the "ready"
state is
> expected
> to be very small when compared with the total number of cycles spent in the
> scheduler's queue.  So the difference between the two counters is a
good
> indicator of how big of an impact data dependencies had on the execution of
> instructions.  When performance is mostly limited by the lack of hardware
> resources, the delta between the two counters is small.  However, the
> number of
> cycles spent in the queue tends to be bigger (i.e. more than 1-3cy)
> especially
> when compared with other low latency instructions.
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the
> register
> file.
>
> Below is an example of verbose output generated by the tool for the
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
> Based on the verbose report, the backend was only able to dispatch two
> instructions 51.5% of the time.  The dispatch group was limited to one
> instruction 44.6% of the cycles, which corresponds to 272 cycles.
>
> If we look at section "Dynamic Dispatch Stall Cycles", we can see
how
> counter
> SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time the
> dispatch
> logic is unable to dispatch a full group of two instructions because the
> scheduler's queue is full.
>
> Section "Scheduler's queue usage" shows how the maximum
number of buffer
> entries
> (i.e. scheduler's queue entries) used at runtime for resource JFPU01
> reached its
> maximum. Note that AMD Jaguar implements three schedulers:
>   * JALU01 - A scheduler for ALU instructions
>   * JLSAGU - A scheduler for address generation
>   * JFPU01 - A scheduler floating point operations.
>
> The dot-product is a kernel of three floating point instructions (a vector
> multiply followed by two horizontal adds).  That explains why only the
> floating
> point scheduler appears to be used according to section
"Scheduler's queue
> usage".
>
> A full scheduler's queue is either caused by data dependency chains, or
by
> a
> sub-optimal usage of hardware resources.  Sometimes, resource pressure can
> be
> mitigated by rewriting the kernel using different instructions that consume
> different scheduler resources.  Schedulers with a small queue are less
> resilient
> to bottlenecks caused by the presence of long data dependencies.
>
> In this example, we can conclude that the IPC is mostly limited by data
> dependencies, and not by resource pressure.
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally
> committed).
>
> The tool only models the out-of-order portion of a processor. Therefore,
> the
> instruction fetch and decode stages are not modeled. Performance
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order from a
> queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit in
> file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field
'IssueWidth'
> from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag
"-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
> Scheduling models don't describe register files, and therefore the tool
> doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
> By default, the tool (optimistically) assumes a single register file with
> an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A
value
> of
> zero for N means 'unbounded'.  Knowing how many temporaries are
available
> for
> register renaming, the tool can predict dispatch stalls caused by the lack
> of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction depends on
> the
> number of micro-opcodes it specifies in the target scheduling model (see
> field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see
> Dispatch.h).
> Its goal is to track the progress of instructions that are
"in-flight", and
> retire instructions in program order.  The number of entries in the reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are
> treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
>
> Zero latency instructions (for example NOP instructions) don't consume
> scheduler
> resources.  However, those instructions still reserve a number of slots in
> the
> reorder buffer.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource implements a
> queue
> of instructions.  An instruction has to wait in the scheduler's queue
until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple
> processor
> schedulers.  A Scheduler is responsible for tracking data dependencies, and
> dynamically select which processor resources are consumed/used by
> instructions.
>
> Internally, the Scheduler class delegates the management of processor
> resource
> units and resource groups to the ResourceManager class.  ResourceManager
> is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a resource
> group,
> the ResourceManager object selects one of the available units from the
> group; by
> default, it uses a round-robin selector to guarantee that resource usage is
> uniformly distributed between all units of a group.
>
> Internally, class Scheduler implements three instruction queues:
>   - WaitQueue: a queue of instructions whose operands are not ready yet.
>   - ReadyQueue: a queue of instructions ready to execute.
>   - IssuedQueue: a queue of instructions executing.
>
> Depending on the operands availability, instructions that are dispatched
> to the
> Scheduler are either placed into the WaitQueue or into the ReadyQueue.
>
> Every cycle, class Scheduler checks if instructions can be moved from the
> WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
> issued to the underlying pipelines.  The algorithm prioritizes older
> instructions over younger instructions.
>
> Objects of class ResourceState (see Scheduler.h) describe processor
> resources.
> There is an instance of class ResourceState for each single processor
> resource
> specified by the scheduling model.  A ResourceState object for a processor
> resource with multiple units dynamically tracks the availability of every
> single
> unit.  For example, the ResourceState of a resource group tracks the
> availability of every resource in that group.  Internally, ResourceState
> implements a round-robin selector to dynamically pick the next unit to use
> from
> the group.
>
> Write-Back and Retire Stage
> ---------------------------
>
> Issued instructions are moved from the ReadyQueue to the IssuedQueue.
> There,
> instructions wait until they reach the write-back stage.  At that point,
> they
> get removed from the queue and the retire control unit is notified.
>
> On the event of "instruction executed", the retire control unit
flags the
> instruction as "ready to retire".
>
> Instruction are retired in program order; an "instruction
retired" event
> is sent
> to the register file which frees the temporary registers allocated for the
> instruction at register renaming stage.
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory operations.
> Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
> for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  The
> number
> of slots in the load/store queues is unknown by the tool, since there is no
> mention of it in the scheduling model.  In practice, users can specify flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
> to be
> equal to exactly N (an unsigned value). If N is zero, then the tool
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and stores.
> The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the load
> does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes that
> loads
> always may-alias store operations.  Essentially, this LSUnit doesn't
> perform any
> sort of alias analysis to rule out cases where loads and stores don't
> overlap
> with each other.  The downside of this approach however is that younger
> loads are
> never allowed to pass older stores.  To make it possible for a younger
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
> Note that, in the case of write-combining memory, rule 2. could be relaxed
> a bit
> to allow reordering of non-aliasing store operations.  That being said, at
> the
> moment, there is no way to further relax the memory model (flag -noalias
> is the
> only option).  Essentially, there is no option to specify a different
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory
types.
>  * LSUnit doesn't know how to identify serializing operations and
memory
> fences.
>
> No assumption is made on the store buffer size.  As mentioned before,
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, and
> it
> doesn't attempt to identify cases where store-to-load forwarding would
> occur in
> practice.
>
> LSUnit doesn't attempt to predict whether a load or store hits or
misses
> the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore".  For
> loads,
> the scheduling model provides an "optimistic" load-to-use latency
(which
> usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects' behaves
> like a
> "soft" load-barrier.  That means, it serializes loads without
forcing a
> flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'.
> This is
> inaccurate, but it is the best that we can do at the moment with the
> current
> information available in LLVM.
>
> A load/store barrier consumes one entry of the load/store queue.  A
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass a
> load
> barrier.  Also, a younger store cannot pass a store barrier.  A younger
> load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store
queue(s).
> That
> also means, by construction, all the older loads/stores have been executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing information to
> give
> an accurate report.  For example, the first sections of this document
> explained
> how the lack of knowledge about the processor negatively affects the
> performance
> analysis.  The lack of knowledge is often a consequence of how scheduling
> models
> are defined; as mentioned before, scheduling models intentionally don't
> describe
> processors in fine details.
>
> The accuracy of the performance analysis is also affected by assumptions
> made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of tight
> loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to
describe
> loop
> buffers.  However, the purpose of that field is to enable loop unrolling of
> tight loops; essentially, it affects the cost model used by pass
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a
> processor,
> and consequently doesn't try to predict the frontend throughput.
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same
> reasons,
> this tool intentionally doesn't model branch prediction.  That being
said,
> this
> tool could be definitely extended in future to also account for the
> hardware
> frontend when doing performance analysis.  This would inevitably require
> extra
> (extensive) processor knowledge related to all the available decoding
> paths in
> the hardware frontend.
>
> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately visible to
> the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading.  According to
the
> tool,
> processor resources are not statically/dynamically partitioned.  Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of
> instructions.
> Instructions are placed in a queue and potentially executed out-of-order
> (based
> on the operand availability). The dispatch stage is definitely distinct
> from the
> issue stage.
>
> This model doesn't correctly describe processors where the
dispatch/issue
> is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it is
> up to
> the compiler to predict the latency of instructions and package issue
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
> to
> support processors with a single dispatch/issue stage. The execution flow
> would
> require some changes in the way how existing components (i.e.
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
> The following sections describes other known limitations.  The goal is not
> to
> provide an extensive list of limitations; we want to report what we
> believe are
> the most important limitations, and suggest possible methods to overcome
> them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already
mentioned
> how
> LLVM doesn't know about serializing operations and memory barriers. 
Most
> of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't
> expose
> those properties.  Instead, both serializing operations and memory barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at least
> for
> scheduling purposes, knowing that an instruction has unmodeled side
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance report.
> One way
> to improve this is by reserving a couple of bits in field 'Flags'
from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations,
> since it
> affects a few out-of-order processors in LLVM.
>
> As mentioned in section 'Instruction Issue', class Scheduler
delegates to
> an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction stages.
> This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch
> targets.
> In particular, the tool doesn't model any sort of branch prediction,
nor
> does it
> attempt to track changes to the program counter.  The tool always assumes
> that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in
sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A warning
> is
> generated if the tool encounters a call instruction in the sequence.
> Return
> instructions are not evaluated, and therefore control flow is not affected.
> However, the tool still queries the processor scheduling model to obtain
> latency
> information for instructions that affect the control flow.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know about
> the
> register files, and temporaries available in each register file for
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe register
> files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the
retire
> control
> unit can retire every cycle.  Ideally, every processor should be able to
> specify
> the retire throughput (for example, by adding an extra field to the
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
>
> On x86-64, a 32-bit GPR write fully updates the super-register. Example:
>       add %edi %eax    ## eax += edi
>
> Here, register %eax aliases the lower half of 64-bit register %rax. On
> x86-64,
> register %rax is fully updated by the 'add' (the upper half of %rax
is
> zeroed).
> Essentially, it "kills" any previous definition of (the upper
half of)
> register
> %rax.
>
> On the other hand, 8/16 bit register writes only perform a so-called
> "partial
> register update". Example:
>       add %di, %ax     ## ax += di
>
> Here, register %eax is only partially updated. To be more specific, the
> lower
> half of %eax is set, and the upper half is left unchanged. There is also no
> change in the upper 48 bits of register %rax.
>
> To get accurate performance analysis, the tool has to know which
> instructions
> perform a partial register update, and which instructions fully update the
> destination's super-register.
>
> One way to expose this information is (again) via tablegen.  For example,
> we
> could add a flag in the tablegen instruction class to tag instructions that
> perform partial register updates. Something like this: 'bit
> hasPartialRegisterUpdate = 1'. However, this would force a `let
> hasPartialRegisterUpdate = 0` on several instruction definitions.
>
> Another approach is to have a MCSubtargetInfo hook similar to this:
>     virtual bool updatesSuperRegisters(unsigned short opcode) { return
> false; }
>
> Targets will be able to override this method if needed.  Again, this is
> just an
> idea. But the plan is to have this fixed as a future development.
>
> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
operation.  The
> advantage is that the fused pair only consumes a single slot in the
> dispatch
> group.
>
> As a future development, the tool should be extended to address
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the opcode
> pairs
> that can be fused together. That table could be exposed to the tool via the
> MCSubtargetInfo interface.  This is just an idea; there may be better ways
> to
> implement this.
>
> 3) Intel processors: mixing legacy SSE with AVX instructions.
>
> On modern Intel processors with AVX, mixing legacy SSE code with AVX code
> negatively impacts the performance.  The tool is not aware of this issue,
> and
> the performance penalty is not accounted when doing the analysis. This is
> something that we would like to improve in future.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance
> analysis.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting the
> tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a
"variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know
anything
> about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant scheduling
> class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework was
> meant
> to work for MachineInstr.
>
> When the tool encounters a "variant" instruction, it assumes a
generic 1cy
> latency. However, the tool would not be able to tell which processor
> resources
> are effectively consumed by the variant instruction.
>
> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to correctly
> predict the runtime performance of the code. This tool must always be able
> to
> obtain the set of implicit/explicit register defs/uses for every
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool takes
> as
> input an assembly sequence. That sequence is parsed into a MCInst sequence
> with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can
> inspect
> the MCOperand sequence of an MCInst to identify register operands. However,
> there is no way to tell register operands that are definitions from
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target instructions
> and
> their operands. The opcode of a machine instruction (a MachineInstr
> object) can
> be used to query the instruction set through method `MCInstrInfo::get'
to
> obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically
> generated
> via tablegen from the instruction set description in the target .td
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the
cardinality
> of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the machine
> operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from
> MachineInstr
> objects through a lowering step. By default the lowering logic simply
> iterates
> over the machine operands of a MachineInstr, and converts/expands them into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That means,
> register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific opcodes.
> Some of
> these routines may lower operands in a way that potentially breaks (some
> of) the
> assumptions on the machine operand sequence which were valid for
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the targets,
> and
> the vast majority of the MachineInstr are lowered based on the default
> strategy
> which preserves the original machine operand sequence.  This is especially
> true
> for x86, where the custom lowering logic always preserves the original
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other
> targets.
> So we plan (with the help of the community) to find a proper mechanism to
> map
> when possible MCOperand indices back to MachineOperand indices of the
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we could
> have an
> index for every register MCOperand (or -1, if the operand didn't exist
in
> the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the
> tool on
> the register operand sequence (In general, these instructions should be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified by
> the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is always
> valid for MCInst. For example: implicit register defs, implicit register
> uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this link:
> https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor resource
> names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep working on
> the
> tool with the help of the community to address all of the limitations
> described
> by the previous sections, and find good solutions/fixes for the design
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
> really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/2745fb40/attachment-0001.html>

Andrea Di Biagio via llvm-dev

2018-Mar-02 14:42 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Hi Andrew,

Thanks for the feedback!

On Fri, Mar 2, 2018 at 1:16 AM, Andrew Trick <atrick at apple.com> wrote:
>
> On Mar 1, 2018, at 9:22 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
> wrote:
>
> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named
> llvm-mca. We
> currently use it internally to statically measure the performance of code,
> and
> to help triage potential problems with target scheduling models.  We
> decided to
> post this RFC because we are interested in the feedback from the
> community, and
> we also believe that other people might be interested in a tool like this.
>
>
> This is a dream come true!
>
> llvm-mca uses information which is already available in LLVM (e.g.
> scheduling
> models) to statically measure the performance of machine code in a
> specific cpu.
> Performance is measured in terms of throughput as well as processor
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of the
> code
> when run on the target, but also help with diagnosing potential performance
> issues.
>
>
> You can’t really have one without the other, which is why this is a dream
> come true.
>
> Given an assembly code sequence, llvm-mca estimates the IPC (instructions
> per
> cycle), as well as hardware resources pressure. The analysis and reporting
> style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able to
> generate
> a detailed report which should help with identifying and analyzing sources
> of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to
> obtain
> read-advance information, and understand how processor resources are used
> by
> instructions.  By design, the quality of the performance analysis
> conducted by
> the tool is inevitably affected by the quality of the target scheduling
> models.
>
> However, scheduling models intentionally do not describe all processors
> details,
> since the goal is just to enable the scheduling of machine instructions
> during
> compilation. That means, there are processor details which are not
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
>
> The LLVM machine model can have as much detail as we want, as long as it’s
> opt-in for targets.
> We’ve always had more detail than the generic scheduler actually needed.
> e.g. the scheduler doesn’t currently care how big the per-resource buffer
> sizes are.
> Targets have their own scheduling strategies that can pick and choose.
>
> At one point I wrote code that could be called by the scheduler to
> simulate OOO execution at much greater detail, but the benefit didn’t
> justify the complexity and compile time. I always felt that a static
> analysis tool was the right place for this kind of simulation.
>
I agree. I am pretty confident that all the extra details can become opt-in
for targets.

> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>
>
> MicroOpBufferSize is presumed to cover register renaming and retirement,
> assuming they are well-balanced. For your tool, you certainly want to be
> more precise.
>
Yes. The long term goal is to have specific (optional) fields for targets
that want to specify a different value. MicroOpBufferSize could still be
used as the default value in the absence of extra information.

>  - Actual dispatch width (it often differs from the issue width).
>
>
> This was always a hard one to generalize in a machine independent way, and
> half the world seems to swap the meaning of those terms. If you’re going to
> make this distinction please define the terms clearly in the machine model
> and explain how tools are expected to use the information.
>
> Currently IssueWidth is used to tell the scheduler that ’N' microops
will
> definitely take ’N’ / ‘IssueWidth’ cycles regardless of which functional
> units or dispatch pipeline is involved.
>
> [Reading below, I saw that you define instruction “dispatch" as what
the
> LLVM machine model calls “issue”. LLVM doesn’t model dispatch directly
> because, by its definition, it is redundant with the number of function
> units, modulo some dynamic behavior that can’t be statically predicted
> anyway]
>
> We also don’t model decoding, because, presumably you hit the micro-op
> limit first.
>
>  - Number of temporary registers available for renaming.
>
>
> See MicroOpBufferSize above.
>
Right. This can be another detail that targets could expose.

>  - Number of read/write ports in the register file(s).
>
>
> The assumption is that we hit micro-op issue width first. I suppose it’s
> good to have though.
>
>  - Length of the load/store queue in the LSUnit.
>
>
> That was supposed to be covered by per-processor-resource buffer size.
> Maybe you want to simulate the load/store queue differently from other
> functional units? e.g. one shared queue across multiple load store units?
>
Yes. To start, I'd like to be able to have a unified LSUnit. In future,
this design could be improved.

> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that all
> of
> these aspects are going to affect the quality of the static analysis
> performed
> by the tool.
>
>
> Like the scheduler, the tool should be extensible so that different
> targets can simulate behavior that doesn’t fit some abstract model.
>
This has been mentioned by Clement too. The long term goal is to make it
possible for targets to specify what model they want. As Clement pointed
out in the code review, this will require a few inital changes to make the
tool more modular in preparation for it. I suggested in the review if it is
possible to use the current design as a baseline, and improve it
incrementally with later patches. Essentially, commit the current baseline
design and then start changing/improving it with other patches would make
it easier for others to contribute their own patches.
>
> An extensive list of known limitations is reported in one of the last
> sections
> of this document. There is also a section related to design problems which
> must
> be addressed (hopefully with the help of the community).  At the moment,
> the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have also
> tested it on other targets to check how generic the tool is. In our
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number
of
> micro
> opcodes specified for an instruction", or "wrong set of hardware
> resources".
> Some of these mistakes are quite common (sometimes on mature models too),
> and
> often difficult to catch.  Reports generated by this tool are simple to
> analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ——————————
>
>
> Nice documentation. Please find a good place for it to live and include it
> in your patch.
>
I can copy/paste this RFC into a README.txt (removing the initial paragraph
and probably the conclusions). :-)

> <snip>
>
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra
> section
> in the report which contains the so-called "timeline view". 
Below is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Truly awesome.
>
Thanks. :-)

>
>
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the
> register
> file.
>
> Below is an example of verbose output generated by the tool for the
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
>
> So great!
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally
> committed).
>
> The tool only models the out-of-order portion of a processor. Therefore,
> the
> instruction fetch and decode stages are not modeled. Performance
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
>
>
> The following sections are fantastic documentation for your tool and the
> machine model in general. I hope you check all this in somewhere.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order from a
> queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit in
> file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field
'IssueWidth'
> from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag
"-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
>
> This is really what’s meant by LLVM’s IssueWidth. I think Intel always
> preferred to call it “dispatch”. Presumably, instructions are buffered
> before being dispatched/issued to functional units, so LLVM’s “IssueWidth”
> is meant to model a hardware restriction that is independent from the
> number of functional units (processor resources).
>
> It’s especially confusing because some µArchs have dispatch pipelines
> decoupled from the functional units.
>
I know... I guess, as long as we standardize the terminology, things should
be fine.

> Scheduling models don't describe register files, and therefore the tool
> doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
>
> The LLVM machine model can model the size of the register renaming pool if
> you like.
>
> By default, the tool (optimistically) assumes a single register file with
> an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A
value
> of
> zero for N means 'unbounded'.  Knowing how many temporaries are
available
> for
> register renaming, the tool can predict dispatch stalls caused by the lack
> of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction depends on
> the
> number of micro-opcodes it specifies in the target scheduling model (see
> field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see
> Dispatch.h).
> Its goal is to track the progress of instructions that are
"in-flight", and
> retire instructions in program order.  The number of entries in the reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are
> treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
>
> Zero latency instructions (for example NOP instructions) don't consume
> scheduler
> resources.  However, those instructions still reserve a number of slots in
> the
> reorder buffer.
>
>
> As currently modeled by dummy µOps:
>
> // A single nop micro-op (uX).
> def WriteX : SchedWriteRes<[]> { let Latency = 0; }
>
> You’ll need MachineInstr’s though.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource implements a
> queue
> of instructions.  An instruction has to wait in the scheduler's queue
until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple
> processor
> schedulers.  A Scheduler is responsible for tracking data dependencies, and
> dynamically select which processor resources are consumed/used by
> instructions.
>
> Internally, the Scheduler class delegates the management of processor
> resource
> units and resource groups to the ResourceManager class.  ResourceManager
> is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a resource
> group,
> the ResourceManager object selects one of the available units from the
> group; by
> default, it uses a round-robin selector to guarantee that resource usage is
> uniformly distributed between all units of a group.
>
>
> To be a cross-subtarget tool, this needs to be a customization point. It’s
> not always so simple.
>
I agree. When I designed this logic, this point in particular was a bit
problematic. I have been thinking the same; it would be nice if scheduler
resource were able to specify how units are dynamically selected. We could
still have a default round_robin strategy in the absence of extra
information. This is definitely something that we should be looking into as
a future development. If you agree, at least to start, we could use the
roundrobin strategy.

> <snip>
>
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory operations.
> Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
> for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  The
> number
> of slots in the load/store queues is unknown by the tool, since there is no
> mention of it in the scheduling model.  In practice, users can specify flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
> to be
> equal to exactly N (an unsigned value). If N is zero, then the tool
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and stores.
> The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the load
> does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes that
> loads
> always may-alias store operations.  Essentially, this LSUnit doesn't
> perform any
> sort of alias analysis to rule out cases where loads and stores don't
> overlap
> with each other.  The downside of this approach however is that younger
> loads are
> never allowed to pass older stores.  To make it possible for a younger
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
>
> I’m surprised it isn’t optimistic by default. Being pessimistic hides
> other bottlenecks and seems less accurate. I guess consistency with other
> similar tools (IACA) should be the aim.
>
I am not sure about what IACA does in this case.
In case, it is easy to change the behavior optimistic.

>
> Note that, in the case of write-combining memory, rule 2. could be relaxed
> a bit
> to allow reordering of non-aliasing store operations.  That being said, at
> the
> moment, there is no way to further relax the memory model (flag -noalias
> is the
> only option).  Essentially, there is no option to specify a different
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory
types.
>  * LSUnit doesn't know how to identify serializing operations and
memory
> fences.
>
>
> Combining the static model with a sampled dynamic hardware event profile
> would be amazing.
>
> No assumption is made on the store buffer size.  As mentioned before,
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, and
> it
> doesn't attempt to identify cases where store-to-load forwarding would
> occur in
> practice.
>
>
> The LSUnits have a buffer size of course. The question is whether you
> really need to separately model issuing the instructions to a separate
> memory pipeline via a shared queue.
>
> LSUnit doesn't attempt to predict whether a load or store hits or
misses
> the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore".  For
> loads,
> the scheduling model provides an "optimistic" load-to-use latency
(which
> usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
>
> You’re optimistic here, which is good, but pessimistic with aliasing.
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects' behaves
> like a
> "soft" load-barrier.  That means, it serializes loads without
forcing a
> flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'.
> This is
> inaccurate, but it is the best that we can do at the moment with the
> current
> information available in LLVM.
>
>
> LLVM *should* have this information. It needs some design work though.
>
I agree.

>
> A load/store barrier consumes one entry of the load/store queue.  A
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass a
> load
> barrier.  Also, a younger store cannot pass a store barrier.  A younger
> load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store
queue(s).
> That
> also means, by construction, all the older loads/stores have been executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing information to
> give
> an accurate report.  For example, the first sections of this document
> explained
> how the lack of knowledge about the processor negatively affects the
> performance
> analysis.  The lack of knowledge is often a consequence of how scheduling
> models
> are defined; as mentioned before, scheduling models intentionally don't
> describe
> processors in fine details.
>
>
> LLVM’s machine model should be optionally extended to model whatever a
> static analysis tool needs. There’s some dynamic behavior that can’t be
> modeled statically—predictive structures, the state of the pipeline
> entering a  loop—but machine model precision shouldn’t be the limiting
> factor for your tool.
>
> The accuracy of the performance analysis is also affected by assumptions
> made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of tight
> loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to
describe
> loop
> buffers.  However, the purpose of that field is to enable loop unrolling of
> tight loops; essentially, it affects the cost model used by pass
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a
> processor,
> and consequently doesn't try to predict the frontend throughput.
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same
> reasons,
> this tool intentionally doesn't model branch prediction.  That being
said,
> this
> tool could be definitely extended in future to also account for the
> hardware
> frontend when doing performance analysis.  This would inevitably require
> extra
> (extensive) processor knowledge related to all the available decoding
> paths in
> the hardware frontend.
>
>
> If loops are ever definitely limited by decoder or fetch throughput, it
> would be good to know that. You don’t need to model the predictive aspect
> of it, or “all the paths”.
>
> You might as well say you don’t need to model issue/dispatch width, or
> retirement logic, because it’s decoupled from OOO execution via instruction
> buffers. The point of this tool is to tell you that there is definitely a
> bottleneck in a particular area of the pipeline.
>
> Would it be more appropriate to say that a simple abstract model of
> decoding is too inaccurate for your particular subtarget so there was not
> enough benefit to implementing it?
>
Ideally, we don't want to exclude the possibility to analyze the frontend
performance. In future, the tool should be extended to account for the
frontend too. Even Simon Pilgrim suggested (in a private conversation) how
it would be very useful to have at least a few information about the
frontend too.
Clement mentioned in his recent post on this thread that his team at Google
implemented a similar tool, and their tool analyzes the frontend too. I am
going to reply to his post after this. But the bottom line is that it would
be great if people contribute the frontend analysis to this tool.

> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately visible to
> the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading.  According to
the
> tool,
> processor resources are not statically/dynamically partitioned.  Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of
> instructions.
> Instructions are placed in a queue and potentially executed out-of-order
> (based
> on the operand availability). The dispatch stage is definitely distinct
> from the
> issue stage.
>
>
> I wonder why in-order analysis doesn’t mostly fall out as a degenerate
> case in your tool. There’s some grey area where in-order processors
> hardware interlocks that cause stalls that aren’t explicit in the
> scheduling groups. Those processors would still benefit from your tool.
> Even if that target doesn’t benefit from the OOO simulation, it would be
> nice to compare the output of your tool with the observed performance to
> find bugs in the model.
>
> This model doesn't correctly describe processors where the
dispatch/issue
> is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it is
> up to
> the compiler to predict the latency of instructions and package issue
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
> to
> support processors with a single dispatch/issue stage. The execution flow
> would
> require some changes in the way how existing components (i.e.
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
>
> Ah…. Extended with future development sounds better.
>
Yeah. Sorry if there are so many TODOs.
This is also the reason why I think it makes sense to ask help to the
community because there is still so much work to do..

> The following sections describes other known limitations.  The goal is not
> to
> provide an extensive list of limitations; we want to report what we
> believe are
> the most important limitations, and suggest possible methods to overcome
> them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already
mentioned
> how
> LLVM doesn't know about serializing operations and memory barriers. 
Most
> of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't
> expose
> those properties.  Instead, both serializing operations and memory barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at least
> for
> scheduling purposes, knowing that an instruction has unmodeled side
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance report.
> One way
> to improve this is by reserving a couple of bits in field 'Flags'
from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations,
> since it
> affects a few out-of-order processors in LLVM.
>
>
> I don’t think OOO LLVM targets should be using itineraries. If those
> targets are still actively maintained, they could migrate.
>
> As mentioned in section 'Instruction Issue', class Scheduler
delegates to
> an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction stages.
> This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch
> targets.
> In particular, the tool doesn't model any sort of branch prediction,
nor
> does it
> attempt to track changes to the program counter.  The tool always assumes
> that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in
sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A warning
> is
> generated if the tool encounters a call instruction in the sequence.
> Return
> instructions are not evaluated, and therefore control flow is not affected.
> However, the tool still queries the processor scheduling model to obtain
> latency
> information for instructions that affect the control flow.
>
>
> By decompiling to MachineInst’s it would be easy to build a mostly
> complete CFG and call graph.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know about
> the
> register files, and temporaries available in each register file for
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe register
> files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the
retire
> control
> unit can retire every cycle.  Ideally, every processor should be able to
> specify
> the retire throughput (for example, by adding an extra field to the
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
> <snip>
>
>
> MachineOperand handles this. You just need to create the machine instrs.
>
Interesting. I couldn't find how to do it. It would be great if somebody
helps me on this.

> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
operation.  The
> advantage is that the fused pair only consumes a single slot in the
> dispatch
> group.
>
> As a future development, the tool should be extended to address
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the opcode
> pairs
> that can be fused together. That table could be exposed to the tool via the
> MCSubtargetInfo interface.  This is just an idea; there may be better ways
> to
> implement this.
>
>
> This is already expressed by subtarget code. That code just needs to be
> exposed as a common subtarget interface. Typically the interfaces take
> MachineInst, but in this case the opcode is probably sufficient.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance
> analysis.
>
>
> The machine model has this, but requires proper MachineInstrs.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting the
> tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a
"variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know
anything
> about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant scheduling
> class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework was
> meant
> to work for MachineInstr.
>
>
> There are good reasons for the scheduler to work with MachineInstrs, and
> any static analysis tool should work with MachineInstrs for the same
> reasons...
>
I agree. If we fix this part, then both the issues described by this
sections would disappear.

> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to correctly
> predict the runtime performance of the code. This tool must always be able
> to
> obtain the set of implicit/explicit register defs/uses for every
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool takes
> as
> input an assembly sequence. That sequence is parsed into a MCInst sequence
> with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can
> inspect
> the MCOperand sequence of an MCInst to identify register operands. However,
> there is no way to tell register operands that are definitions from
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target instructions
> and
> their operands. The opcode of a machine instruction (a MachineInstr
> object) can
> be used to query the instruction set through method `MCInstrInfo::get'
to
> obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically
> generated
> via tablegen from the instruction set description in the target .td
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the
cardinality
> of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the machine
> operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from
> MachineInstr
> objects through a lowering step. By default the lowering logic simply
> iterates
> over the machine operands of a MachineInstr, and converts/expands them into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That means,
> register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific opcodes.
> Some of
> these routines may lower operands in a way that potentially breaks (some
> of) the
> assumptions on the machine operand sequence which were valid for
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the targets,
> and
> the vast majority of the MachineInstr are lowered based on the default
> strategy
> which preserves the original machine operand sequence.  This is especially
> true
> for x86, where the custom lowering logic always preserves the original
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other
> targets.
> So we plan (with the help of the community) to find a proper mechanism to
> map
> when possible MCOperand indices back to MachineOperand indices of the
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we could
> have an
> index for every register MCOperand (or -1, if the operand didn't exist
in
> the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the
> tool on
> the register operand sequence (In general, these instructions should be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified by
> the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is always
> valid for MCInst. For example: implicit register defs, implicit register
> uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
>
> You just made a very strong argument for building the MachineInstrs before
> running mca. So I wonder why you didn’t do that.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this link:
> https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor resource
> names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep working on
> the
> tool with the help of the community to address all of the limitations
> described
> by the previous sections, and find good solutions/fixes for the design
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
> really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
> There are a number of people on llvm-dev who can explain better than I how
> to decompile into MachineInstrs. I’m not totally opposed to checking in
> something that works with MCInstr, but this does run strongly contrary to
> the design of LLVM’s subtarget support.
>
That would be great! I would be very happy if somebody suggests how to do
it (or does it for me :-)).
Do you think the current design (modulo the changes suggested in the
review) would be acceptable to start?

Thanks again for your great feedback Andy!

-Andrea

>
> -Andy
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/d5b4497a/attachment-0001.html>

Andrea Di Biagio via llvm-dev

2018-Mar-02 16:26 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

(sending this message again. Apparently it didn't make it into llvm-dev
because the message was too long. Sorry for the spam).


Hi Clement!

On Fri, Mar 2, 2018 at 12:33 PM, Clement Courbet <courbet at google.com>
wrote:
> Hi Andrea,
>
> Thanks for this great RFC ! I've put some high-level comments here, and
> I'll give more focused comments in the review on Phabricator.
>
>Thanks!

>
> On Thu, Mar 1, 2018 at 6:22 PM, Andrea Di Biagio <
> andrea.dibiagio at gmail.com> wrote:
>
>> Hi all,
>>
>> At Sony we developed an LLVM based performance analysis tool named
>> llvm-mca. We
>> currently use it internally to statically measure the performance of
>> code, and
>> to help triage potential problems with target scheduling models.  We
>> decided to
>> post this RFC because we are interested in the feedback from the
>> community, and
>> we also believe that other people might be interested in a tool like
this.
>>
>>
> We are definitely interested in tools like this. As a matter of fact, we
> (Google) have developed a similar tool. Our approach is very similar to the
> one described here, so I’m confident that we can merge our tool with the
> one proposed here. There are some similarities and differences in the
> design, which I’ll quickly go over here. I’ll give my opinions about
> respective merits of each approach when discussing specific points below.
>
> The main differences with the design are:
>
>    -
>
>    We simulate the CPU frontend too (parse/decode). We’ve found out that
>    some kernels benefit from instruction reorderings that do not change
port
>    pressure but change how instructions are parsed and decoded (e.g. parse
>    window can be a bottleneck).
>    -
>
>    Our code is more modular on the subtarget model side. Our simulator is
>    divided in four parts:
>    -
>
>       CPU components (parser/decoder/renamer/ROB/Issue Port/...) are
>       modeled and unit tested individually. These are intended to be mostly
>       target-independant.
>       -
>
>       Each subtarget describes how to assemble these components together.
>       This is where knowledge about the target is injected.
>       -
>
>       The framework drives the components in a subtarget-agnostic way.
>       Events are logged to a `SimulationLog` object. This is similar to the
>       `HWEventListener` interface in llvm-mca.
>       -
>
>       Analysis passes analyze the `SimulationLog` to extract whatever
>       metric they care about (e.g. port pressure or IPC), or generate an
>       annotated trace. A similar functionality is provided in llvm-mca
`XXView`
>       implementations.
>
> We also have a IACA-like binary that displays analysis results.
>
> For reference, our code can be found here: https://github.com/google/EXEg
> esis/tree/master/llvm_sim
>
Thanks for the info. I will definitely look into your code.

I had a look at some of your comments in the review, and I agree that the
design can/should be improved.Our plan is also to move at one point the
analysis into a library in order to make it accessible from the compiler
(this was requested by Simon (Pilgrim) many times :-)).
It would be great if we could work together on this tool, and I like your
idea of integrating features from your tool into this one.
We could start by further modularizing the tool (as you have already
suggested in the phabricator review). We could then integrate all the
missing analysis (plus improvements) from your code into this.
Cheers,
-Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/1c12ba68/attachment.html>

Philip Reames via llvm-dev

2018-Mar-02 17:33 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

This looks absolutely awesome.  Thank you for sharing your work; I can 
see many ways this will be useful.

Philip


On 03/01/2018 09:22 AM, Andrea Di Biagio via llvm-dev
wrote:> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named 
> llvm-mca. We
> currently use it internally to statically measure the performance of 
> code, and
> to help triage potential problems with target scheduling models.  We 
> decided to
> post this RFC because we are interested in the feedback from the 
> community, and
> we also believe that other people might be interested in a tool like this.
>
> llvm-mca uses information which is already available in LLVM (e.g. 
> scheduling
> models) to statically measure the performance of machine code in a 
> specific cpu.
> Performance is measured in terms of throughput as well as processor 
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of 
> the code
> when run on the target, but also help with diagnosing potential 
> performance
> issues.
>
> Given an assembly code sequence, llvm-mca estimates the IPC 
> (instructions per
> cycle), as well as hardware resources pressure. The analysis and 
> reporting style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of 
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able 
> to generate
> a detailed report which should help with identifying and analyzing 
> sources of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to 
> obtain
> read-advance information, and understand how processor resources are 
> used by
> instructions.  By design, the quality of the performance analysis 
> conducted by
> the tool is inevitably affected by the quality of the target 
> scheduling models.
>
> However, scheduling models intentionally do not describe all 
> processors details,
> since the goal is just to enable the scheduling of machine 
> instructions during
> compilation. That means, there are processor details which are not 
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>  - Actual dispatch width (it often differs from the issue width).
>  - Number of temporary registers available for renaming.
>  - Number of read/write ports in the register file(s).
>  - Length of the load/store queue in the LSUnit.
>
> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that 
> all of
> these aspects are going to affect the quality of the static analysis 
> performed
> by the tool.
>
> An extensive list of known limitations is reported in one of the last 
> sections
> of this document. There is also a section related to design problems 
> which must
> be addressed (hopefully with the help of the community).  At the 
> moment, the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra 
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used 
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have 
> also
> tested it on other targets to check how generic the tool is. In our 
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number 
> of micro
> opcodes specified for an instruction", or "wrong set of hardware 
> resources".
> Some of these mistakes are quite common (sometimes on mature models 
> too), and
> often difficult to catch.  Reports generated by this tool are simple 
> to analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ---------------------
>
> The tool takes assembly code as input. Assembly code is parsed into a 
> sequence
> of MCInst with the help of the existing LLVM target assembly parsers. 
> The parsed
> sequence of MCInst is then analyzed by a 'Backend' module to
generate a
> performance report.
>
> The Backend module internally emulates the execution of the machine code
> sequence in a loop of iterations (which by default is 70). At the end 
> of this
> process, the backend collects a number of statistics which are then 
> printed out
> in the form of a report.
>
> Here is an example of performance report generated by the tool for a 
> dot-product
> of two packed float vectors of four elements. The analysis is 
> conducted for
> target x86, cpu btver2:
>
> ///////////////////
>
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Resources:
> [0] - JALU0
> [1] - JALU1
> [2] - JDiv
> [3] - JFPM
> [4] - JFPU0
> [5] - JFPU1
> [6] - JLAGU
> [7] - JSAGU
> [8] - JSTC
> [9] - JVIMUL
>
>
> Resource pressure per iteration:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]
>  -      -      -      -     2.00   1.00    -      -      - -
>
> Resource pressure by instruction:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]        
> Instructions:
>  -      -      -      -      -     1.00    -      -      - -         
> vmulps    %xmm0, %xmm1, %xmm2
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm2, %xmm2, %xmm3
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Instruction Info:
> [1]: #uOps
> [2]: Latency
> [3]: RThroughput
> [4]: MayLoad
> [5]: MayStore
> [6]: HasSideEffects
>
> [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
>  1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2
>  1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3
>  1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4
>
> ///////////////////
>
> According to this report, the dot-product kernel has been executed 300 
> times,
> for a total of 900 instructions dynamically executed.
>
> The report is structured in three main sections.  A first section 
> collects a few
> performance numbers; the goal of this section is to give a very quick 
> overview
> of the performance throughput. In this example, the two important 
> perforamce
> indicators are a) the predicted total number of cycles, and b) the IPC.
> IPC is probably the most important throughput indicator. A big delta 
> between the
> Dispatch Width and the computed IPC is an indicator of potential 
> performance
> issues.
>
> The second section is the so-called "resource pressure view".
This
> view reports
> the average number of resource cycles consumed every iteration by 
> instructions
> for every processor resource unit available on the target. Information is
> structured in two tables. The first table reports the number of 
> resource cycles
> spent on average every iteration. The second table correlates the resource
> cycles to the machine instruction in the sequence. For example, every 
> iteration
> of the dot-product, instruction 'vmulps' always executes on
resource
> unit [5]
> (JFPU1 - floating point pipeline #1), consuming an average of 1 
> resource cycle
> per iteration.  Note that on Jaguar, vector FP multiply can only be 
> issued to
> pipeline JFPU1, while horizontal FP adds can only be issued to 
> pipeline JFPU0.
>
> The third (and last) section of the report shows the latency and 
> reciprocal
> throughput of every instruction in the sequence. That section also 
> reports extra
> information related to the number of micro opcodes, and opcode 
> properties (i.e.
> 'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
>
> The resource pressure view helps with identifying bottlenecks caused 
> by high
> usage of specific hardware resources.  Situations with resource 
> pressure mainly
> concentrated on a few resources should, in general, be avoided. Ideally,
> pressure should be uniformly distributed between multiple resources.
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra
> section
> in the report which contains the so-called "timeline view". Below
is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Average Wait times (based on the timeline view):
> [0]: Executions
> [1]: Average time spent waiting in a scheduler's queue
> [2]: Average time spent waiting in a scheduler's queue while ready
> [3]: Average time elapsed from WB until retire stage
>
>       [0]    [1]    [2]    [3]
> 0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2
> 1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3
> 2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4
> ///////////////
>
> The timeline view is very interesting because it shows how 
> instructions changed
> in state during execution.  It also gives an idea of how the tool
"sees"
> instructions executed on the target.
>
> The timeline view is structured in two tables.  The first table shows how
> instructions change in state over time (measured in cycles); the 
> second table
> (named "Average Wait times") reports useful timing statistics
which
> should help
> diagnose performance bottlenecks caused by long data dependencies and 
> sub-optimal
> usage of hardware resources.
>
> An instruction in the timeline view is identified by a pair of 
> indices, where
> the 'first' index identifies an iteration, and the 'second'
index is
> the actual
> instruction index (i.e. where it appears in the code sequence).
>
> Excluding the first and last column, the remaining columns are in 
> cycles.  Cycles
> are numbered sequentially starting from 0.  The following characters 
> are used to
> describe the state of an instruction:
>
>  D : Instruction dispatched.
>  e : Instruction executing.
>  E : Instruction executed.
>  R : Instruction retired.
>  = : Instruction already dispatched, waiting to be executed.
>  - : Instruction executed, waiting to be retired.
>
> Based on the timeline view from the example, we know that:
>   - Instruction [1, 0] was dispatched at cycle 1.
>   - Instruction [1, 0] started executing at cycle 2.
>   - Instruction [1, 0] reached the write back stage at cycle 4.
>   - Instruction [1, 0] was retired at cycle 10.
>
> Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to 
> wait in
> the scheduler's queue for the operands to become available. By the 
> time the
> vmulps is dispatched, operands are already available, and pipeline 
> JFPU1 is
> ready to serve another instruction.  So the instruction can be immediately
> issued on the JFPU1 pipeline. That is demonstrated by the fact that the
> instruction only spent 1cy in the scheduler's queue.
>
> There is a gap of 5 cycles between the write-back stage and the retire 
> event.
> That is because instructions must retire in program order, so [1,0] 
> has to wait
> for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
>
> In the dot-product example, all instructions are in a RAW (Read After 
> Write)
> dependency chain.  Register %xmm2 written by the vmulps is immediately 
> used by
> the first vhaddps, and register %xmm3 written by the first vhaddps is 
> used by
> the second vhaddps.  Long data dependencies negatively affect the ILP
> (Instruction Level Parallelism).
>
> In the dot-product example, there are anti-dependencies introduced by
> instructions from different iterations.  However, those dependencies 
> can be
> removed at register renaming stage (at the cost of allocating register 
> aliases,
> and therefore consuming temporary registers).
>
> Table "Average Wait times" helps diagnose performance issues that
are
> caused by
> the presence of long latency instructions and potentially long data 
> dependencies
> which may limit the ILP.  Note that the tool by default assumes at 
> least 1cy
> between the dispatch event and the issue event.
>
> When the performance is limited by data dependencies and/or long latency
> instructions, the number of cycles spent while in the "ready"
state is
> expected
> to be very small when compared with the total number of cycles spent 
> in the
> scheduler's queue.  So the difference between the two counters is a
good
> indicator of how big of an impact data dependencies had on the 
> execution of
> instructions.  When performance is mostly limited by the lack of hardware
> resources, the delta between the two counters is small. However, the 
> number of
> cycles spent in the queue tends to be bigger (i.e. more than 1-3cy) 
> especially
> when compared with other low latency instructions.
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the 
> register
> file.
>
> Below is an example of verbose output generated by the tool for the 
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions 
> retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
> Based on the verbose report, the backend was only able to dispatch two
> instructions 51.5% of the time.  The dispatch group was limited to one
> instruction 44.6% of the cycles, which corresponds to 272 cycles.
>
> If we look at section "Dynamic Dispatch Stall Cycles", we can see
how
> counter
> SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time 
> the dispatch
> logic is unable to dispatch a full group of two instructions because the
> scheduler's queue is full.
>
> Section "Scheduler's queue usage" shows how the maximum
number of
> buffer entries
> (i.e. scheduler's queue entries) used at runtime for resource JFPU01 
> reached its
> maximum. Note that AMD Jaguar implements three schedulers:
>   * JALU01 - A scheduler for ALU instructions
>   * JLSAGU - A scheduler for address generation
>   * JFPU01 - A scheduler floating point operations.
>
> The dot-product is a kernel of three floating point instructions (a vector
> multiply followed by two horizontal adds).  That explains why only the 
> floating
> point scheduler appears to be used according to section
"Scheduler's queue
> usage".
>
> A full scheduler's queue is either caused by data dependency chains, 
> or by a
> sub-optimal usage of hardware resources.  Sometimes, resource pressure 
> can be
> mitigated by rewriting the kernel using different instructions that 
> consume
> different scheduler resources.  Schedulers with a small queue are less 
> resilient
> to bottlenecks caused by the presence of long data dependencies.
>
> In this example, we can conclude that the IPC is mostly limited by data
> dependencies, and not by resource pressure.
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order 
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally 
> committed).
>
> The tool only models the out-of-order portion of a processor. 
> Therefore, the
> instruction fetch and decode stages are not modeled. Performance 
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that 
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order 
> from a queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit 
> in file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware 
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class 
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field 
> 'IssueWidth' from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag
"-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
> Scheduling models don't describe register files, and therefore the 
> tool doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
> By default, the tool (optimistically) assumes a single register file 
> with an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A 
> value of
> zero for N means 'unbounded'.  Knowing how many temporaries are 
> available for
> register renaming, the tool can predict dispatch stalls caused by the 
> lack of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction 
> depends on the
> number of micro-opcodes it specifies in the target scheduling model 
> (see field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see 
> Dispatch.h).
> Its goal is to track the progress of instructions that are 
> "in-flight", and
> retire instructions in program order.  The number of entries in the 
> reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler 
> buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are 
> treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
>
> Zero latency instructions (for example NOP instructions) don't consume 
> scheduler
> resources.  However, those instructions still reserve a number of 
> slots in the
> reorder buffer.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource 
> implements a queue
> of instructions.  An instruction has to wait in the scheduler's queue 
> until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model 
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple 
> processor
> schedulers.  A Scheduler is responsible for tracking data 
> dependencies, and
> dynamically select which processor resources are consumed/used by 
> instructions.
>
> Internally, the Scheduler class delegates the management of processor 
> resource
> units and resource groups to the ResourceManager class. 
> ResourceManager is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a 
> resource group,
> the ResourceManager object selects one of the available units from the 
> group; by
> default, it uses a round-robin selector to guarantee that resource 
> usage is
> uniformly distributed between all units of a group.
>
> Internally, class Scheduler implements three instruction queues:
>   - WaitQueue: a queue of instructions whose operands are not ready yet.
>   - ReadyQueue: a queue of instructions ready to execute.
>   - IssuedQueue: a queue of instructions executing.
>
> Depending on the operands availability, instructions that are 
> dispatched to the
> Scheduler are either placed into the WaitQueue or into the ReadyQueue.
>
> Every cycle, class Scheduler checks if instructions can be moved from the
> WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue 
> can be
> issued to the underlying pipelines.  The algorithm prioritizes older
> instructions over younger instructions.
>
> Objects of class ResourceState (see Scheduler.h) describe processor 
> resources.
> There is an instance of class ResourceState for each single processor 
> resource
> specified by the scheduling model.  A ResourceState object for a processor
> resource with multiple units dynamically tracks the availability of 
> every single
> unit.  For example, the ResourceState of a resource group tracks the
> availability of every resource in that group.  Internally, ResourceState
> implements a round-robin selector to dynamically pick the next unit to 
> use from
> the group.
>
> Write-Back and Retire Stage
> ---------------------------
>
> Issued instructions are moved from the ReadyQueue to the IssuedQueue.  
> There,
> instructions wait until they reach the write-back stage.  At that 
> point, they
> get removed from the queue and the retire control unit is notified.
>
> On the event of "instruction executed", the retire control unit
flags the
> instruction as "ready to retire".
>
> Instruction are retired in program order; an "instruction
retired"
> event is sent
> to the register file which frees the temporary registers allocated for the
> instruction at register renaming stage.
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory 
> operations.  Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing 
> queues for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  
> The number
> of slots in the load/store queues is unknown by the tool, since there 
> is no
> mention of it in the scheduling model.  In practice, users can specify 
> flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the 
> queue to be
> equal to exactly N (an unsigned value). If N is zero, then the tool 
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and 
> stores. The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the 
> load does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes 
> that loads
> always may-alias store operations.  Essentially, this LSUnit doesn't 
> perform any
> sort of alias analysis to rule out cases where loads and stores don't 
> overlap
> with each other.  The downside of this approach however is that 
> younger loads are
> never allowed to pass older stores.  To make it possible for a younger 
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
> Note that, in the case of write-combining memory, rule 2. could be 
> relaxed a bit
> to allow reordering of non-aliasing store operations.  That being 
> said, at the
> moment, there is no way to further relax the memory model (flag 
> -noalias is the
> only option).  Essentially, there is no option to specify a different 
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory 
> types.
>  * LSUnit doesn't know how to identify serializing operations and 
> memory fences.
>
> No assumption is made on the store buffer size.  As mentioned before, 
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, 
> and it
> doesn't attempt to identify cases where store-to-load forwarding would 
> occur in
> practice.
>
> LSUnit doesn't attempt to predict whether a load or store hits or 
> misses the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore". 
> For loads,
> the scheduling model provides an "optimistic" load-to-use latency
> (which usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects'
> behaves like a
> "soft" load-barrier.  That means, it serializes loads without
forcing
> a flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with 
> 'UnmodeledSideEffects'.  This is
> inaccurate, but it is the best that we can do at the moment with the 
> current
> information available in LLVM.
>
> A load/store barrier consumes one entry of the load/store queue.  A 
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass 
> a load
> barrier.  Also, a younger store cannot pass a store barrier.  A 
> younger load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store 
> queue(s). That
> also means, by construction, all the older loads/stores have been 
> executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing 
> information to give
> an accurate report.  For example, the first sections of this document 
> explained
> how the lack of knowledge about the processor negatively affects the 
> performance
> analysis.  The lack of knowledge is often a consequence of how 
> scheduling models
> are defined; as mentioned before, scheduling models intentionally 
> don't describe
> processors in fine details.
>
> The accuracy of the performance analysis is also affected by 
> assumptions made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated 
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of 
> tight loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to 
> describe loop
> buffers.  However, the purpose of that field is to enable loop 
> unrolling of
> tight loops; essentially, it affects the cost model used by pass 
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a 
> processor,
> and consequently doesn't try to predict the frontend throughput.  
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same 
> reasons,
> this tool intentionally doesn't model branch prediction.  That being 
> said, this
> tool could be definitely extended in future to also account for the 
> hardware
> frontend when doing performance analysis.  This would inevitably 
> require extra
> (extensive) processor knowledge related to all the available decoding 
> paths in
> the hardware frontend.
>
> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately 
> visible to the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading. According to 
> the tool,
> processor resources are not statically/dynamically partitioned. Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of 
> instructions.
> Instructions are placed in a queue and potentially executed 
> out-of-order (based
> on the operand availability). The dispatch stage is definitely 
> distinct from the
> issue stage.
>
> This model doesn't correctly describe processors where the 
> dispatch/issue is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it 
> is up to
> the compiler to predict the latency of instructions and package issue 
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be 
> extended/adapted to
> support processors with a single dispatch/issue stage. The execution 
> flow would
> require some changes in the way how existing components (i.e. 
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
> The following sections describes other known limitations.  The goal is 
> not to
> provide an extensive list of limitations; we want to report what we 
> believe are
> the most important limitations, and suggest possible methods to 
> overcome them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already 
> mentioned how
> LLVM doesn't know about serializing operations and memory barriers.  
> Most of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't 
> expose
> those properties.  Instead, both serializing operations and memory 
> barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at
> least for
> scheduling purposes, knowing that an instruction has unmodeled side 
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance 
> report.  One way
> to improve this is by reserving a couple of bits in field 'Flags'
from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations, 
> since it
> affects a few out-of-order processors in LLVM.
>
> As mentioned in section 'Instruction Issue', class Scheduler
delegates
> to an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction 
> stages.  This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, 
> indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch 
> targets.
> In particular, the tool doesn't model any sort of branch prediction, 
> nor does it
> attempt to track changes to the program counter.  The tool always 
> assumes that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in 
> sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A 
> warning is
> generated if the tool encounters a call instruction in the sequence.  
> Return
> instructions are not evaluated, and therefore control flow is not 
> affected.
> However, the tool still queries the processor scheduling model to 
> obtain latency
> information for instructions that affect the control flow.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know
> about the
> register files, and temporaries available in each register file for 
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe 
> register files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the 
> retire control
> unit can retire every cycle.  Ideally, every processor should be able 
> to specify
> the retire throughput (for example, by adding an extra field to the 
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
>
> On x86-64, a 32-bit GPR write fully updates the super-register. Example:
>       add %edi %eax    ## eax += edi
>
> Here, register %eax aliases the lower half of 64-bit register %rax. On 
> x86-64,
> register %rax is fully updated by the 'add' (the upper half of %rax
is
> zeroed).
> Essentially, it "kills" any previous definition of (the upper
half of)
> register
> %rax.
>
> On the other hand, 8/16 bit register writes only perform a so-called 
> "partial
> register update". Example:
>       add %di, %ax     ## ax += di
>
> Here, register %eax is only partially updated. To be more specific, 
> the lower
> half of %eax is set, and the upper half is left unchanged. There is 
> also no
> change in the upper 48 bits of register %rax.
>
> To get accurate performance analysis, the tool has to know which 
> instructions
> perform a partial register update, and which instructions fully update the
> destination's super-register.
>
> One way to expose this information is (again) via tablegen.  For 
> example, we
> could add a flag in the tablegen instruction class to tag instructions 
> that
> perform partial register updates. Something like this: 'bit
> hasPartialRegisterUpdate = 1'. However, this would force a `let
> hasPartialRegisterUpdate = 0` on several instruction definitions.
>
> Another approach is to have a MCSubtargetInfo hook similar to this:
>     virtual bool updatesSuperRegisters(unsigned short opcode) { return 
> false; }
>
> Targets will be able to override this method if needed.  Again, this 
> is just an
> idea. But the plan is to have this fixed as a future development.
>
> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
> operation.  The
> advantage is that the fused pair only consumes a single slot in the 
> dispatch
> group.
>
> As a future development, the tool should be extended to address 
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the 
> opcode pairs
> that can be fused together. That table could be exposed to the tool 
> via the
> MCSubtargetInfo interface.  This is just an idea; there may be better 
> ways to
> implement this.
>
> 3) Intel processors: mixing legacy SSE with AVX instructions.
>
> On modern Intel processors with AVX, mixing legacy SSE code with AVX code
> negatively impacts the performance.  The tool is not aware of this 
> issue, and
> the performance penalty is not accounted when doing the analysis. This is
> something that we would like to improve in future.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out 
> register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance 
> analysis.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting 
> the tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a
"variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved 
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know 
> anything about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant 
> scheduling class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework 
> was meant
> to work for MachineInstr.
>
> When the tool encounters a "variant" instruction, it assumes a
generic 1cy
> latency. However, the tool would not be able to tell which processor 
> resources
> are effectively consumed by the variant instruction.
>
> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to 
> correctly
> predict the runtime performance of the code. This tool must always be 
> able to
> obtain the set of implicit/explicit register defs/uses for every 
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool 
> takes as
> input an assembly sequence. That sequence is parsed into a MCInst 
> sequence with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can 
> inspect
> the MCOperand sequence of an MCInst to identify register operands. 
> However,
> there is no way to tell register operands that are definitions from 
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target 
> instructions and
> their operands. The opcode of a machine instruction (a MachineInstr 
> object) can
> be used to query the instruction set through method `MCInstrInfo::get' 
> to obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of 
> MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe 
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically 
> generated
> via tablegen from the instruction set description in the target .td 
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the 
> cardinality of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning 
> of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the 
> machine operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from 
> MachineInstr
> objects through a lowering step. By default the lowering logic simply 
> iterates
> over the machine operands of a MachineInstr, and converts/expands them 
> into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That 
> means, register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific 
> opcodes. Some of
> these routines may lower operands in a way that potentially breaks 
> (some of) the
> assumptions on the machine operand sequence which were valid for 
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the 
> targets, and
> the vast majority of the MachineInstr are lowered based on the default 
> strategy
> which preserves the original machine operand sequence.  This is 
> especially true
> for x86, where the custom lowering logic always preserves the original 
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This 
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other 
> targets.
> So we plan (with the help of the community) to find a proper mechanism 
> to map
> when possible MCOperand indices back to MachineOperand indices of the 
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we 
> could have an
> index for every register MCOperand (or -1, if the operand didn't exist 
> in the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the 
> tool on
> the register operand sequence (In general, these instructions should 
> be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand 
> sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified 
> by the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is 
> always
> valid for MCInst. For example: implicit register defs, implicit 
> register uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this 
> link: https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor 
> resource names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep 
> working on the
> tool with the help of the community to address all of the limitations 
> described
> by the previous sections, and find good solutions/fixes for the design 
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell 
> who really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/8e72fade/attachment-0001.html>

Philip Reames via llvm-dev

2018-Mar-02 18:06 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Reading through this last night got be thinking about how to model 
control flow.  Given most of my source code tends to be very branchy, be 
limited to straight line code is quite restrictive.  The main thing is 
that it requires a lot of hand simplification which can be rather error 
prone at times.

It occurs to me that we could remove the restriction around branches 
without necessarily fully modeling fetch, decode, or prediction.  If in 
addition to an assembly file, we also feed the tool a trace of branch 
executions, the tool could essentially unroll all loops dynamically.  
The type of thing I'm thinking of would look like something like this: 
(ENTRY, branch_dest, branch_dest, branch_dest) where each
"branch_dest"
is the taken successor of the next encountered branch instruction.

As an example, consider a simple loop which executes 3 times:
ENTRY:
   # fallthrough
bb1:
   # various instructions
   jne bb1
bb2:
   ret

The trace for this would look like: (ENTRY, bb1, bb1, bb1, FALLTHROUGH).

The nice thing about such a branch trace is that it separate the source 
of the trace from the analysis tool.  You could provide an actual trace 
collected through instrumentation, or a predicted trace generated from 
sample profiling information.

(There are obvious extensions to handle indirect branches, calls, and 
such here.  I'm leaving that as an exercise for the reader at the moment.)

By itself, this seems like a really useful usability improvement.

You could also build on top of this notion to model both branch 
prediction mispredicts and resource limitations.  (Or only one or the 
other.)  From the perspective of the predictor, the trace would be 
ground truth so you could analyze how well a particularly prediction 
strategy works on that particular trace.  Distinct from the prediction, 
you could model delay between branch execution and condition 
satisfaction to identify resource limits in the actual speculative 
execution.  (I'm very deliberately being vague here; it's been a long 
time since I looked into this aspect of things and don't remember the 
appropriate terminology.  Figured it was better to be vague than 
potentially misleading by getting it wrong.)

Philip

On 03/01/2018 09:22 AM, Andrea Di Biagio via llvm-dev
wrote:> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named 
> llvm-mca. We
> currently use it internally to statically measure the performance of 
> code, and
> to help triage potential problems with target scheduling models.  We 
> decided to
> post this RFC because we are interested in the feedback from the 
> community, and
> we also believe that other people might be interested in a tool like this.
>
> llvm-mca uses information which is already available in LLVM (e.g. 
> scheduling
> models) to statically measure the performance of machine code in a 
> specific cpu.
> Performance is measured in terms of throughput as well as processor 
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of 
> the code
> when run on the target, but also help with diagnosing potential 
> performance
> issues.
>
> Given an assembly code sequence, llvm-mca estimates the IPC 
> (instructions per
> cycle), as well as hardware resources pressure. The analysis and 
> reporting style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of 
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able 
> to generate
> a detailed report which should help with identifying and analyzing 
> sources of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to 
> obtain
> read-advance information, and understand how processor resources are 
> used by
> instructions.  By design, the quality of the performance analysis 
> conducted by
> the tool is inevitably affected by the quality of the target 
> scheduling models.
>
> However, scheduling models intentionally do not describe all 
> processors details,
> since the goal is just to enable the scheduling of machine 
> instructions during
> compilation. That means, there are processor details which are not 
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>  - Actual dispatch width (it often differs from the issue width).
>  - Number of temporary registers available for renaming.
>  - Number of read/write ports in the register file(s).
>  - Length of the load/store queue in the LSUnit.
>
> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that 
> all of
> these aspects are going to affect the quality of the static analysis 
> performed
> by the tool.
>
> An extensive list of known limitations is reported in one of the last 
> sections
> of this document. There is also a section related to design problems 
> which must
> be addressed (hopefully with the help of the community).  At the 
> moment, the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra 
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used 
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have 
> also
> tested it on other targets to check how generic the tool is. In our 
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number 
> of micro
> opcodes specified for an instruction", or "wrong set of hardware 
> resources".
> Some of these mistakes are quite common (sometimes on mature models 
> too), and
> often difficult to catch.  Reports generated by this tool are simple 
> to analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ---------------------
>
> The tool takes assembly code as input. Assembly code is parsed into a 
> sequence
> of MCInst with the help of the existing LLVM target assembly parsers. 
> The parsed
> sequence of MCInst is then analyzed by a 'Backend' module to
generate a
> performance report.
>
> The Backend module internally emulates the execution of the machine code
> sequence in a loop of iterations (which by default is 70). At the end 
> of this
> process, the backend collects a number of statistics which are then 
> printed out
> in the form of a report.
>
> Here is an example of performance report generated by the tool for a 
> dot-product
> of two packed float vectors of four elements. The analysis is 
> conducted for
> target x86, cpu btver2:
>
> ///////////////////
>
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Resources:
> [0] - JALU0
> [1] - JALU1
> [2] - JDiv
> [3] - JFPM
> [4] - JFPU0
> [5] - JFPU1
> [6] - JLAGU
> [7] - JSAGU
> [8] - JSTC
> [9] - JVIMUL
>
>
> Resource pressure per iteration:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]
>  -      -      -      -     2.00   1.00    -      -      - -
>
> Resource pressure by instruction:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]        
> Instructions:
>  -      -      -      -      -     1.00    -      -      - -         
> vmulps    %xmm0, %xmm1, %xmm2
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm2, %xmm2, %xmm3
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Instruction Info:
> [1]: #uOps
> [2]: Latency
> [3]: RThroughput
> [4]: MayLoad
> [5]: MayStore
> [6]: HasSideEffects
>
> [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
>  1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2
>  1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3
>  1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4
>
> ///////////////////
>
> According to this report, the dot-product kernel has been executed 300 
> times,
> for a total of 900 instructions dynamically executed.
>
> The report is structured in three main sections.  A first section 
> collects a few
> performance numbers; the goal of this section is to give a very quick 
> overview
> of the performance throughput. In this example, the two important 
> perforamce
> indicators are a) the predicted total number of cycles, and b) the IPC.
> IPC is probably the most important throughput indicator. A big delta 
> between the
> Dispatch Width and the computed IPC is an indicator of potential 
> performance
> issues.
>
> The second section is the so-called "resource pressure view".
This
> view reports
> the average number of resource cycles consumed every iteration by 
> instructions
> for every processor resource unit available on the target. Information is
> structured in two tables. The first table reports the number of 
> resource cycles
> spent on average every iteration. The second table correlates the resource
> cycles to the machine instruction in the sequence. For example, every 
> iteration
> of the dot-product, instruction 'vmulps' always executes on
resource
> unit [5]
> (JFPU1 - floating point pipeline #1), consuming an average of 1 
> resource cycle
> per iteration.  Note that on Jaguar, vector FP multiply can only be 
> issued to
> pipeline JFPU1, while horizontal FP adds can only be issued to 
> pipeline JFPU0.
>
> The third (and last) section of the report shows the latency and 
> reciprocal
> throughput of every instruction in the sequence. That section also 
> reports extra
> information related to the number of micro opcodes, and opcode 
> properties (i.e.
> 'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
>
> The resource pressure view helps with identifying bottlenecks caused 
> by high
> usage of specific hardware resources.  Situations with resource 
> pressure mainly
> concentrated on a few resources should, in general, be avoided. Ideally,
> pressure should be uniformly distributed between multiple resources.
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra
> section
> in the report which contains the so-called "timeline view". Below
is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Average Wait times (based on the timeline view):
> [0]: Executions
> [1]: Average time spent waiting in a scheduler's queue
> [2]: Average time spent waiting in a scheduler's queue while ready
> [3]: Average time elapsed from WB until retire stage
>
>       [0]    [1]    [2]    [3]
> 0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2
> 1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3
> 2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4
> ///////////////
>
> The timeline view is very interesting because it shows how 
> instructions changed
> in state during execution.  It also gives an idea of how the tool
"sees"
> instructions executed on the target.
>
> The timeline view is structured in two tables.  The first table shows how
> instructions change in state over time (measured in cycles); the 
> second table
> (named "Average Wait times") reports useful timing statistics
which
> should help
> diagnose performance bottlenecks caused by long data dependencies and 
> sub-optimal
> usage of hardware resources.
>
> An instruction in the timeline view is identified by a pair of 
> indices, where
> the 'first' index identifies an iteration, and the 'second'
index is
> the actual
> instruction index (i.e. where it appears in the code sequence).
>
> Excluding the first and last column, the remaining columns are in 
> cycles.  Cycles
> are numbered sequentially starting from 0.  The following characters 
> are used to
> describe the state of an instruction:
>
>  D : Instruction dispatched.
>  e : Instruction executing.
>  E : Instruction executed.
>  R : Instruction retired.
>  = : Instruction already dispatched, waiting to be executed.
>  - : Instruction executed, waiting to be retired.
>
> Based on the timeline view from the example, we know that:
>   - Instruction [1, 0] was dispatched at cycle 1.
>   - Instruction [1, 0] started executing at cycle 2.
>   - Instruction [1, 0] reached the write back stage at cycle 4.
>   - Instruction [1, 0] was retired at cycle 10.
>
> Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to 
> wait in
> the scheduler's queue for the operands to become available. By the 
> time the
> vmulps is dispatched, operands are already available, and pipeline 
> JFPU1 is
> ready to serve another instruction.  So the instruction can be immediately
> issued on the JFPU1 pipeline. That is demonstrated by the fact that the
> instruction only spent 1cy in the scheduler's queue.
>
> There is a gap of 5 cycles between the write-back stage and the retire 
> event.
> That is because instructions must retire in program order, so [1,0] 
> has to wait
> for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
>
> In the dot-product example, all instructions are in a RAW (Read After 
> Write)
> dependency chain.  Register %xmm2 written by the vmulps is immediately 
> used by
> the first vhaddps, and register %xmm3 written by the first vhaddps is 
> used by
> the second vhaddps.  Long data dependencies negatively affect the ILP
> (Instruction Level Parallelism).
>
> In the dot-product example, there are anti-dependencies introduced by
> instructions from different iterations.  However, those dependencies 
> can be
> removed at register renaming stage (at the cost of allocating register 
> aliases,
> and therefore consuming temporary registers).
>
> Table "Average Wait times" helps diagnose performance issues that
are
> caused by
> the presence of long latency instructions and potentially long data 
> dependencies
> which may limit the ILP.  Note that the tool by default assumes at 
> least 1cy
> between the dispatch event and the issue event.
>
> When the performance is limited by data dependencies and/or long latency
> instructions, the number of cycles spent while in the "ready"
state is
> expected
> to be very small when compared with the total number of cycles spent 
> in the
> scheduler's queue.  So the difference between the two counters is a
good
> indicator of how big of an impact data dependencies had on the 
> execution of
> instructions.  When performance is mostly limited by the lack of hardware
> resources, the delta between the two counters is small. However, the 
> number of
> cycles spent in the queue tends to be bigger (i.e. more than 1-3cy) 
> especially
> when compared with other low latency instructions.
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the 
> register
> file.
>
> Below is an example of verbose output generated by the tool for the 
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions 
> retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
> Based on the verbose report, the backend was only able to dispatch two
> instructions 51.5% of the time.  The dispatch group was limited to one
> instruction 44.6% of the cycles, which corresponds to 272 cycles.
>
> If we look at section "Dynamic Dispatch Stall Cycles", we can see
how
> counter
> SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time 
> the dispatch
> logic is unable to dispatch a full group of two instructions because the
> scheduler's queue is full.
>
> Section "Scheduler's queue usage" shows how the maximum
number of
> buffer entries
> (i.e. scheduler's queue entries) used at runtime for resource JFPU01 
> reached its
> maximum. Note that AMD Jaguar implements three schedulers:
>   * JALU01 - A scheduler for ALU instructions
>   * JLSAGU - A scheduler for address generation
>   * JFPU01 - A scheduler floating point operations.
>
> The dot-product is a kernel of three floating point instructions (a vector
> multiply followed by two horizontal adds).  That explains why only the 
> floating
> point scheduler appears to be used according to section
"Scheduler's queue
> usage".
>
> A full scheduler's queue is either caused by data dependency chains, 
> or by a
> sub-optimal usage of hardware resources.  Sometimes, resource pressure 
> can be
> mitigated by rewriting the kernel using different instructions that 
> consume
> different scheduler resources.  Schedulers with a small queue are less 
> resilient
> to bottlenecks caused by the presence of long data dependencies.
>
> In this example, we can conclude that the IPC is mostly limited by data
> dependencies, and not by resource pressure.
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order 
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally 
> committed).
>
> The tool only models the out-of-order portion of a processor. 
> Therefore, the
> instruction fetch and decode stages are not modeled. Performance 
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that 
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order 
> from a queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit 
> in file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware 
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class 
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field 
> 'IssueWidth' from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag
"-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
> Scheduling models don't describe register files, and therefore the 
> tool doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
> By default, the tool (optimistically) assumes a single register file 
> with an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A 
> value of
> zero for N means 'unbounded'.  Knowing how many temporaries are 
> available for
> register renaming, the tool can predict dispatch stalls caused by the 
> lack of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction 
> depends on the
> number of micro-opcodes it specifies in the target scheduling model 
> (see field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see 
> Dispatch.h).
> Its goal is to track the progress of instructions that are 
> "in-flight", and
> retire instructions in program order.  The number of entries in the 
> reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler 
> buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are 
> treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
>
> Zero latency instructions (for example NOP instructions) don't consume 
> scheduler
> resources.  However, those instructions still reserve a number of 
> slots in the
> reorder buffer.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource 
> implements a queue
> of instructions.  An instruction has to wait in the scheduler's queue 
> until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model 
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple 
> processor
> schedulers.  A Scheduler is responsible for tracking data 
> dependencies, and
> dynamically select which processor resources are consumed/used by 
> instructions.
>
> Internally, the Scheduler class delegates the management of processor 
> resource
> units and resource groups to the ResourceManager class. 
> ResourceManager is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a 
> resource group,
> the ResourceManager object selects one of the available units from the 
> group; by
> default, it uses a round-robin selector to guarantee that resource 
> usage is
> uniformly distributed between all units of a group.
>
> Internally, class Scheduler implements three instruction queues:
>   - WaitQueue: a queue of instructions whose operands are not ready yet.
>   - ReadyQueue: a queue of instructions ready to execute.
>   - IssuedQueue: a queue of instructions executing.
>
> Depending on the operands availability, instructions that are 
> dispatched to the
> Scheduler are either placed into the WaitQueue or into the ReadyQueue.
>
> Every cycle, class Scheduler checks if instructions can be moved from the
> WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue 
> can be
> issued to the underlying pipelines.  The algorithm prioritizes older
> instructions over younger instructions.
>
> Objects of class ResourceState (see Scheduler.h) describe processor 
> resources.
> There is an instance of class ResourceState for each single processor 
> resource
> specified by the scheduling model.  A ResourceState object for a processor
> resource with multiple units dynamically tracks the availability of 
> every single
> unit.  For example, the ResourceState of a resource group tracks the
> availability of every resource in that group.  Internally, ResourceState
> implements a round-robin selector to dynamically pick the next unit to 
> use from
> the group.
>
> Write-Back and Retire Stage
> ---------------------------
>
> Issued instructions are moved from the ReadyQueue to the IssuedQueue.  
> There,
> instructions wait until they reach the write-back stage.  At that 
> point, they
> get removed from the queue and the retire control unit is notified.
>
> On the event of "instruction executed", the retire control unit
flags the
> instruction as "ready to retire".
>
> Instruction are retired in program order; an "instruction
retired"
> event is sent
> to the register file which frees the temporary registers allocated for the
> instruction at register renaming stage.
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory 
> operations.  Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing 
> queues for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  
> The number
> of slots in the load/store queues is unknown by the tool, since there 
> is no
> mention of it in the scheduling model.  In practice, users can specify 
> flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the 
> queue to be
> equal to exactly N (an unsigned value). If N is zero, then the tool 
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and 
> stores. The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the 
> load does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes 
> that loads
> always may-alias store operations.  Essentially, this LSUnit doesn't 
> perform any
> sort of alias analysis to rule out cases where loads and stores don't 
> overlap
> with each other.  The downside of this approach however is that 
> younger loads are
> never allowed to pass older stores.  To make it possible for a younger 
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
> Note that, in the case of write-combining memory, rule 2. could be 
> relaxed a bit
> to allow reordering of non-aliasing store operations.  That being 
> said, at the
> moment, there is no way to further relax the memory model (flag 
> -noalias is the
> only option).  Essentially, there is no option to specify a different 
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory 
> types.
>  * LSUnit doesn't know how to identify serializing operations and 
> memory fences.
>
> No assumption is made on the store buffer size.  As mentioned before, 
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, 
> and it
> doesn't attempt to identify cases where store-to-load forwarding would 
> occur in
> practice.
>
> LSUnit doesn't attempt to predict whether a load or store hits or 
> misses the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore". 
> For loads,
> the scheduling model provides an "optimistic" load-to-use latency
> (which usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects'
> behaves like a
> "soft" load-barrier.  That means, it serializes loads without
forcing
> a flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with 
> 'UnmodeledSideEffects'.  This is
> inaccurate, but it is the best that we can do at the moment with the 
> current
> information available in LLVM.
>
> A load/store barrier consumes one entry of the load/store queue.  A 
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass 
> a load
> barrier.  Also, a younger store cannot pass a store barrier.  A 
> younger load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store 
> queue(s). That
> also means, by construction, all the older loads/stores have been 
> executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing 
> information to give
> an accurate report.  For example, the first sections of this document 
> explained
> how the lack of knowledge about the processor negatively affects the 
> performance
> analysis.  The lack of knowledge is often a consequence of how 
> scheduling models
> are defined; as mentioned before, scheduling models intentionally 
> don't describe
> processors in fine details.
>
> The accuracy of the performance analysis is also affected by 
> assumptions made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated 
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of 
> tight loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to 
> describe loop
> buffers.  However, the purpose of that field is to enable loop 
> unrolling of
> tight loops; essentially, it affects the cost model used by pass 
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a 
> processor,
> and consequently doesn't try to predict the frontend throughput.  
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same 
> reasons,
> this tool intentionally doesn't model branch prediction.  That being 
> said, this
> tool could be definitely extended in future to also account for the 
> hardware
> frontend when doing performance analysis.  This would inevitably 
> require extra
> (extensive) processor knowledge related to all the available decoding 
> paths in
> the hardware frontend.
>
> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately 
> visible to the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading. According to 
> the tool,
> processor resources are not statically/dynamically partitioned. Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of 
> instructions.
> Instructions are placed in a queue and potentially executed 
> out-of-order (based
> on the operand availability). The dispatch stage is definitely 
> distinct from the
> issue stage.
>
> This model doesn't correctly describe processors where the 
> dispatch/issue is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it 
> is up to
> the compiler to predict the latency of instructions and package issue 
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be 
> extended/adapted to
> support processors with a single dispatch/issue stage. The execution 
> flow would
> require some changes in the way how existing components (i.e. 
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
> The following sections describes other known limitations.  The goal is 
> not to
> provide an extensive list of limitations; we want to report what we 
> believe are
> the most important limitations, and suggest possible methods to 
> overcome them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already 
> mentioned how
> LLVM doesn't know about serializing operations and memory barriers.  
> Most of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't 
> expose
> those properties.  Instead, both serializing operations and memory 
> barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at
> least for
> scheduling purposes, knowing that an instruction has unmodeled side 
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance 
> report.  One way
> to improve this is by reserving a couple of bits in field 'Flags'
from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations, 
> since it
> affects a few out-of-order processors in LLVM.
>
> As mentioned in section 'Instruction Issue', class Scheduler
delegates
> to an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction 
> stages.  This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, 
> indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch 
> targets.
> In particular, the tool doesn't model any sort of branch prediction, 
> nor does it
> attempt to track changes to the program counter.  The tool always 
> assumes that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in 
> sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A 
> warning is
> generated if the tool encounters a call instruction in the sequence.  
> Return
> instructions are not evaluated, and therefore control flow is not 
> affected.
> However, the tool still queries the processor scheduling model to 
> obtain latency
> information for instructions that affect the control flow.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know
> about the
> register files, and temporaries available in each register file for 
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe 
> register files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the 
> retire control
> unit can retire every cycle.  Ideally, every processor should be able 
> to specify
> the retire throughput (for example, by adding an extra field to the 
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
>
> On x86-64, a 32-bit GPR write fully updates the super-register. Example:
>       add %edi %eax    ## eax += edi
>
> Here, register %eax aliases the lower half of 64-bit register %rax. On 
> x86-64,
> register %rax is fully updated by the 'add' (the upper half of %rax
is
> zeroed).
> Essentially, it "kills" any previous definition of (the upper
half of)
> register
> %rax.
>
> On the other hand, 8/16 bit register writes only perform a so-called 
> "partial
> register update". Example:
>       add %di, %ax     ## ax += di
>
> Here, register %eax is only partially updated. To be more specific, 
> the lower
> half of %eax is set, and the upper half is left unchanged. There is 
> also no
> change in the upper 48 bits of register %rax.
>
> To get accurate performance analysis, the tool has to know which 
> instructions
> perform a partial register update, and which instructions fully update the
> destination's super-register.
>
> One way to expose this information is (again) via tablegen.  For 
> example, we
> could add a flag in the tablegen instruction class to tag instructions 
> that
> perform partial register updates. Something like this: 'bit
> hasPartialRegisterUpdate = 1'. However, this would force a `let
> hasPartialRegisterUpdate = 0` on several instruction definitions.
>
> Another approach is to have a MCSubtargetInfo hook similar to this:
>     virtual bool updatesSuperRegisters(unsigned short opcode) { return 
> false; }
>
> Targets will be able to override this method if needed.  Again, this 
> is just an
> idea. But the plan is to have this fixed as a future development.
>
> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
> operation.  The
> advantage is that the fused pair only consumes a single slot in the 
> dispatch
> group.
>
> As a future development, the tool should be extended to address 
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the 
> opcode pairs
> that can be fused together. That table could be exposed to the tool 
> via the
> MCSubtargetInfo interface.  This is just an idea; there may be better 
> ways to
> implement this.
>
> 3) Intel processors: mixing legacy SSE with AVX instructions.
>
> On modern Intel processors with AVX, mixing legacy SSE code with AVX code
> negatively impacts the performance.  The tool is not aware of this 
> issue, and
> the performance penalty is not accounted when doing the analysis. This is
> something that we would like to improve in future.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out 
> register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance 
> analysis.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting 
> the tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a
"variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved 
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know 
> anything about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant 
> scheduling class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework 
> was meant
> to work for MachineInstr.
>
> When the tool encounters a "variant" instruction, it assumes a
generic 1cy
> latency. However, the tool would not be able to tell which processor 
> resources
> are effectively consumed by the variant instruction.
>
> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to 
> correctly
> predict the runtime performance of the code. This tool must always be 
> able to
> obtain the set of implicit/explicit register defs/uses for every 
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool 
> takes as
> input an assembly sequence. That sequence is parsed into a MCInst 
> sequence with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can 
> inspect
> the MCOperand sequence of an MCInst to identify register operands. 
> However,
> there is no way to tell register operands that are definitions from 
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target 
> instructions and
> their operands. The opcode of a machine instruction (a MachineInstr 
> object) can
> be used to query the instruction set through method `MCInstrInfo::get' 
> to obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of 
> MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe 
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically 
> generated
> via tablegen from the instruction set description in the target .td 
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the 
> cardinality of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning 
> of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the 
> machine operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from 
> MachineInstr
> objects through a lowering step. By default the lowering logic simply 
> iterates
> over the machine operands of a MachineInstr, and converts/expands them 
> into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That 
> means, register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific 
> opcodes. Some of
> these routines may lower operands in a way that potentially breaks 
> (some of) the
> assumptions on the machine operand sequence which were valid for 
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the 
> targets, and
> the vast majority of the MachineInstr are lowered based on the default 
> strategy
> which preserves the original machine operand sequence.  This is 
> especially true
> for x86, where the custom lowering logic always preserves the original 
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This 
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other 
> targets.
> So we plan (with the help of the community) to find a proper mechanism 
> to map
> when possible MCOperand indices back to MachineOperand indices of the 
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we 
> could have an
> index for every register MCOperand (or -1, if the operand didn't exist 
> in the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the 
> tool on
> the register operand sequence (In general, these instructions should 
> be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand 
> sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified 
> by the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is 
> always
> valid for MCInst. For example: implicit register defs, implicit 
> register uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this 
> link: https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor 
> resource names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep 
> working on the
> tool with the help of the community to address all of the limitations 
> described
> by the previous sections, and find good solutions/fixes for the design 
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell 
> who really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/a40b76be/attachment-0001.html>

Andrea Di Biagio via llvm-dev

2018-Mar-02 20:21 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Hi Philip,

On Fri, Mar 2, 2018 at 6:06 PM, Philip Reames <listmail at
philipreames.com>
wrote:
> Reading through this last night got be thinking about how to model control
> flow.  Given most of my source code tends to be very branchy, be limited to
> straight line code is quite restrictive.  The main thing is that it
> requires a lot of hand simplification which can be rather error prone at
> times.
>
> It occurs to me that we could remove the restriction around branches
> without necessarily fully modeling fetch, decode, or prediction.  If in
> addition to an assembly file, we also feed the tool a trace of branch
> executions, the tool could essentially unroll all loops dynamically.  The
> type of thing I'm thinking of would look like something like this:
(ENTRY,
> branch_dest, branch_dest, branch_dest) where each "branch_dest"
is the
> taken successor of the next encountered branch instruction.
>
> As an example, consider a simple loop which executes 3 times:
> ENTRY:
>   # fallthrough
> bb1:
>   # various instructions
>   jne bb1
> bb2:
>   ret
>
> The trace for this would look like: (ENTRY, bb1, bb1, bb1, FALLTHROUGH).
>
> The nice thing about such a branch trace is that it separate the source of
> the trace from the analysis tool.  You could provide an actual trace
> collected through instrumentation, or a predicted trace generated from
> sample profiling information.
>
I think so too. That would be an interesting extension.

Ideally, a trace could contain even more than just branch information. As
Andy wrote in his first comment " Combining the static model with a sampled
dynamic hardware event profile would be amazing.".  The idea of having a
branch trace is nice and it can be a start.

>
> (There are obvious extensions to handle indirect branches, calls, and such
> here.  I'm leaving that as an exercise for the reader at the moment.)
>
> By itself, this seems like a really useful usability improvement.
>
> You could also build on top of this notion to model both branch prediction
> mispredicts and resource limitations.  (Or only one or the other.)  From
> the perspective of the predictor, the trace would be ground truth so you
> could analyze how well a particularly prediction strategy works on that
> particular trace.  Distinct from the prediction, you could model delay
> between branch execution and condition satisfaction to identify resource
> limits in the actual speculative execution.  (I'm very deliberately
being
> vague here; it's been a long time since I looked into this aspect of
things
> and don't remember the appropriate terminology.  Figured it was better
to
> be vague than potentially misleading by getting it wrong.)
>

Makes sense. Thanks for the idea.

-Andrea

>
> Philip
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/310c06b5/attachment.html>

UE US via llvm-dev

2018-Mar-05 09:59 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Hello,

I just wanted to say I like this a lot.  I can think of at least three
projects I worked on in the past that could have used such a tool as a sort
of rough indicator even by itself but especially in combination with
performance testing on hardware as you mentioned.

Having worked on a project that went round trip from executable to MCInst &
analysis to executable again I don't see any problems with using that as
the layer for analyzing in this way.

I agree with your reasons for not modelling certain things like branch
prediction fully.   Once you get into things like that you have to start
thinking about loop alignment, speculative execution (how do we count the
penalty for executing code that won't run on a mispredict, or the benefit
of doing it on a correct one?), the list goes on.  Same with the SSE + AVX
thing, and similar situations like AVX-512 downclocking and such.  A simple
model for branches and loads / stores also seems to fit in with what AMD
has said about optimizing for Ryzen because of the prediction being based
on a hardware neural network;  just going on their documentation on that
one.

The important thing is that it seems like it would make testing for new
processor scheduling models nicer;  it seems like a good tool to include in
a full LLVM build.  It could even enhance the test suite with cases
basically like what you mentioned with the btver2 work that would allow for
a sort of spot check.  The biggest problem there is that microcode updates
are capable of changing things that might show up there, but that's getting
into more detail than might be necessary.

Thanks for the work on this guys,
-Gordon


On Thu, Mar 1, 2018 at 11:22 AM, Andrea Di Biagio via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named
> llvm-mca. We
> currently use it internally to statically measure the performance of code,
> and
> to help triage potential problems with target scheduling models.  We
> decided to
> post this RFC because we are interested in the feedback from the
> community, and
> we also believe that other people might be interested in a tool like this.
>
> llvm-mca uses information which is already available in LLVM (e.g.
> scheduling
> models) to statically measure the performance of machine code in a
> specific cpu.
> Performance is measured in terms of throughput as well as processor
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of the
> code
> when run on the target, but also help with diagnosing potential performance
> issues.
>
> Given an assembly code sequence, llvm-mca estimates the IPC (instructions
> per
> cycle), as well as hardware resources pressure. The analysis and reporting
> style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able to
> generate
> a detailed report which should help with identifying and analyzing sources
> of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to
> obtain
> read-advance information, and understand how processor resources are used
> by
> instructions.  By design, the quality of the performance analysis
> conducted by
> the tool is inevitably affected by the quality of the target scheduling
> models.
>
> However, scheduling models intentionally do not describe all processors
> details,
> since the goal is just to enable the scheduling of machine instructions
> during
> compilation. That means, there are processor details which are not
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>  - Actual dispatch width (it often differs from the issue width).
>  - Number of temporary registers available for renaming.
>  - Number of read/write ports in the register file(s).
>  - Length of the load/store queue in the LSUnit.
>
> It is also very difficult to find a "good" abstract model to
describe the
> behavior of out-of-order processors. So, we have to keep in mind that all
> of
> these aspects are going to affect the quality of the static analysis
> performed
> by the tool.
>
> An extensive list of known limitations is reported in one of the last
> sections
> of this document. There is also a section related to design problems which
> must
> be addressed (hopefully with the help of the community).  At the moment,
> the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have also
> tested it on other targets to check how generic the tool is. In our
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number
of
> micro
> opcodes specified for an instruction", or "wrong set of hardware
> resources".
> Some of these mistakes are quite common (sometimes on mature models too),
> and
> often difficult to catch.  Reports generated by this tool are simple to
> analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ---------------------
>
> The tool takes assembly code as input. Assembly code is parsed into a
> sequence
> of MCInst with the help of the existing LLVM target assembly parsers. The
> parsed
> sequence of MCInst is then analyzed by a 'Backend' module to
generate a
> performance report.
>
> The Backend module internally emulates the execution of the machine code
> sequence in a loop of iterations (which by default is 70). At the end of
> this
> process, the backend collects a number of statistics which are then
> printed out
> in the form of a report.
>
> Here is an example of performance report generated by the tool for a
> dot-product
> of two packed float vectors of four elements. The analysis is conducted for
> target x86, cpu btver2:
>
> ///////////////////
>
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Resources:
> [0] - JALU0
> [1] - JALU1
> [2] - JDiv
> [3] - JFPM
> [4] - JFPU0
> [5] - JFPU1
> [6] - JLAGU
> [7] - JSAGU
> [8] - JSTC
> [9] - JVIMUL
>
>
> Resource pressure per iteration:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
>  -      -      -      -     2.00   1.00    -      -      -      -
>
> Resource pressure by instruction:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
> Instructions:
>  -      -      -      -      -     1.00    -      -      -      -
> vmulps    %xmm0, %xmm1, %xmm2
>  -      -      -      -     1.00    -      -      -      -      -
> vhaddps    %xmm2, %xmm2, %xmm3
>  -      -      -      -     1.00    -      -      -      -      -
> vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Instruction Info:
> [1]: #uOps
> [2]: Latency
> [3]: RThroughput
> [4]: MayLoad
> [5]: MayStore
> [6]: HasSideEffects
>
> [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
>  1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2
>  1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3
>  1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4
>
> ///////////////////
>
> According to this report, the dot-product kernel has been executed 300
> times,
> for a total of 900 instructions dynamically executed.
>
> The report is structured in three main sections.  A first section collects
> a few
> performance numbers; the goal of this section is to give a very quick
> overview
> of the performance throughput. In this example, the two important
> perforamce
> indicators are a) the predicted total number of cycles, and b) the IPC.
> IPC is probably the most important throughput indicator. A big delta
> between the
> Dispatch Width and the computed IPC is an indicator of potential
> performance
> issues.
>
> The second section is the so-called "resource pressure view". 
This view
> reports
> the average number of resource cycles consumed every iteration by
> instructions
> for every processor resource unit available on the target.  Information is
> structured in two tables. The first table reports the number of resource
> cycles
> spent on average every iteration. The second table correlates the resource
> cycles to the machine instruction in the sequence. For example, every
> iteration
> of the dot-product, instruction 'vmulps' always executes on
resource unit
> [5]
> (JFPU1 - floating point pipeline #1), consuming an average of 1 resource
> cycle
> per iteration.  Note that on Jaguar, vector FP multiply can only be issued
> to
> pipeline JFPU1, while horizontal FP adds can only be issued to pipeline
> JFPU0.
>
> The third (and last) section of the report shows the latency and reciprocal
> throughput of every instruction in the sequence. That section also reports
> extra
> information related to the number of micro opcodes, and opcode properties
> (i.e.
> 'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
>
> The resource pressure view helps with identifying bottlenecks caused by
> high
> usage of specific hardware resources.  Situations with resource pressure
> mainly
> concentrated on a few resources should, in general, be avoided.  Ideally,
> pressure should be uniformly distributed between multiple resources.
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can
be
> enabled using the command line flag '-timeline'.  This prints an
extra
> section
> in the report which contains the so-called "timeline view". 
Below is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Average Wait times (based on the timeline view):
> [0]: Executions
> [1]: Average time spent waiting in a scheduler's queue
> [2]: Average time spent waiting in a scheduler's queue while ready
> [3]: Average time elapsed from WB until retire stage
>
>       [0]    [1]    [2]    [3]
> 0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2
> 1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3
> 2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4
> ///////////////
>
> The timeline view is very interesting because it shows how instructions
> changed
> in state during execution.  It also gives an idea of how the tool
"sees"
> instructions executed on the target.
>
> The timeline view is structured in two tables.  The first table shows how
> instructions change in state over time (measured in cycles); the second
> table
> (named "Average Wait times") reports useful timing statistics
which should
> help
> diagnose performance bottlenecks caused by long data dependencies and
> sub-optimal
> usage of hardware resources.
>
> An instruction in the timeline view is identified by a pair of indices,
> where
> the 'first' index identifies an iteration, and the 'second'
index is the
> actual
> instruction index (i.e. where it appears in the code sequence).
>
> Excluding the first and last column, the remaining columns are in cycles.
> Cycles
> are numbered sequentially starting from 0.  The following characters are
> used to
> describe the state of an instruction:
>
>  D : Instruction dispatched.
>  e : Instruction executing.
>  E : Instruction executed.
>  R : Instruction retired.
>  = : Instruction already dispatched, waiting to be executed.
>  - : Instruction executed, waiting to be retired.
>
> Based on the timeline view from the example, we know that:
>   - Instruction [1, 0] was dispatched at cycle 1.
>   - Instruction [1, 0] started executing at cycle 2.
>   - Instruction [1, 0] reached the write back stage at cycle 4.
>   - Instruction [1, 0] was retired at cycle 10.
>
> Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to
> wait in
> the scheduler's queue for the operands to become available. By the time
the
> vmulps is dispatched, operands are already available, and pipeline JFPU1 is
> ready to serve another instruction.  So the instruction can be immediately
> issued on the JFPU1 pipeline. That is demonstrated by the fact that the
> instruction only spent 1cy in the scheduler's queue.
>
> There is a gap of 5 cycles between the write-back stage and the retire
> event.
> That is because instructions must retire in program order, so [1,0] has to
> wait
> for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
>
> In the dot-product example, all instructions are in a RAW (Read After
> Write)
> dependency chain.  Register %xmm2 written by the vmulps is immediately
> used by
> the first vhaddps, and register %xmm3 written by the first vhaddps is used
> by
> the second vhaddps.  Long data dependencies negatively affect the ILP
> (Instruction Level Parallelism).
>
> In the dot-product example, there are anti-dependencies introduced by
> instructions from different iterations.  However, those dependencies can be
> removed at register renaming stage (at the cost of allocating register
> aliases,
> and therefore consuming temporary registers).
>
> Table "Average Wait times" helps diagnose performance issues that
are
> caused by
> the presence of long latency instructions and potentially long data
> dependencies
> which may limit the ILP.  Note that the tool by default assumes at least
> 1cy
> between the dispatch event and the issue event.
>
> When the performance is limited by data dependencies and/or long latency
> instructions, the number of cycles spent while in the "ready"
state is
> expected
> to be very small when compared with the total number of cycles spent in the
> scheduler's queue.  So the difference between the two counters is a
good
> indicator of how big of an impact data dependencies had on the execution of
> instructions.  When performance is mostly limited by the lack of hardware
> resources, the delta between the two counters is small.  However, the
> number of
> cycles spent in the queue tends to be bigger (i.e. more than 1-3cy)
> especially
> when compared with other low latency instructions.
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters
for the
> dispatch logic, the reorder buffer, the retire control unit and the
> register
> file.
>
> Below is an example of verbose output generated by the tool for the
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
> Based on the verbose report, the backend was only able to dispatch two
> instructions 51.5% of the time.  The dispatch group was limited to one
> instruction 44.6% of the cycles, which corresponds to 272 cycles.
>
> If we look at section "Dynamic Dispatch Stall Cycles", we can see
how
> counter
> SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time the
> dispatch
> logic is unable to dispatch a full group of two instructions because the
> scheduler's queue is full.
>
> Section "Scheduler's queue usage" shows how the maximum
number of buffer
> entries
> (i.e. scheduler's queue entries) used at runtime for resource JFPU01
> reached its
> maximum. Note that AMD Jaguar implements three schedulers:
>   * JALU01 - A scheduler for ALU instructions
>   * JLSAGU - A scheduler for address generation
>   * JFPU01 - A scheduler floating point operations.
>
> The dot-product is a kernel of three floating point instructions (a vector
> multiply followed by two horizontal adds).  That explains why only the
> floating
> point scheduler appears to be used according to section
"Scheduler's queue
> usage".
>
> A full scheduler's queue is either caused by data dependency chains, or
by
> a
> sub-optimal usage of hardware resources.  Sometimes, resource pressure can
> be
> mitigated by rewriting the kernel using different instructions that consume
> different scheduler resources.  Schedulers with a small queue are less
> resilient
> to bottlenecks caused by the presence of long data dependencies.
>
> In this example, we can conclude that the IPC is mostly limited by data
> dependencies, and not by resource pressure.
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally
> committed).
>
> The tool only models the out-of-order portion of a processor. Therefore,
> the
> instruction fetch and decode stages are not modeled. Performance
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't
know
> anything about branch prediction.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order from a
> queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit in
> file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field
'IssueWidth'
> from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag
"-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
> Scheduling models don't describe register files, and therefore the tool
> doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
> By default, the tool (optimistically) assumes a single register file with
> an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A
value
> of
> zero for N means 'unbounded'.  Knowing how many temporaries are
available
> for
> register renaming, the tool can predict dispatch stalls caused by the lack
> of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction depends on
> the
> number of micro-opcodes it specifies in the target scheduling model (see
> field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see
> Dispatch.h).
> Its goal is to track the progress of instructions that are
"in-flight", and
> retire instructions in program order.  The number of entries in the reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the
target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are
> treated
> like "scheduler" resources, and the field 'BufferSize'
(from the processor
> resource tablegen definition) defines the size of the scheduler's
queue.
>
> Zero latency instructions (for example NOP instructions) don't consume
> scheduler
> resources.  However, those instructions still reserve a number of slots in
> the
> reorder buffer.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource implements a
> queue
> of instructions.  An instruction has to wait in the scheduler's queue
until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple
> processor
> schedulers.  A Scheduler is responsible for tracking data dependencies, and
> dynamically select which processor resources are consumed/used by
> instructions.
>
> Internally, the Scheduler class delegates the management of processor
> resource
> units and resource groups to the ResourceManager class.  ResourceManager
> is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a resource
> group,
> the ResourceManager object selects one of the available units from the
> group; by
> default, it uses a round-robin selector to guarantee that resource usage is
> uniformly distributed between all units of a group.
>
> Internally, class Scheduler implements three instruction queues:
>   - WaitQueue: a queue of instructions whose operands are not ready yet.
>   - ReadyQueue: a queue of instructions ready to execute.
>   - IssuedQueue: a queue of instructions executing.
>
> Depending on the operands availability, instructions that are dispatched
> to the
> Scheduler are either placed into the WaitQueue or into the ReadyQueue.
>
> Every cycle, class Scheduler checks if instructions can be moved from the
> WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
> issued to the underlying pipelines.  The algorithm prioritizes older
> instructions over younger instructions.
>
> Objects of class ResourceState (see Scheduler.h) describe processor
> resources.
> There is an instance of class ResourceState for each single processor
> resource
> specified by the scheduling model.  A ResourceState object for a processor
> resource with multiple units dynamically tracks the availability of every
> single
> unit.  For example, the ResourceState of a resource group tracks the
> availability of every resource in that group.  Internally, ResourceState
> implements a round-robin selector to dynamically pick the next unit to use
> from
> the group.
>
> Write-Back and Retire Stage
> ---------------------------
>
> Issued instructions are moved from the ReadyQueue to the IssuedQueue.
> There,
> instructions wait until they reach the write-back stage.  At that point,
> they
> get removed from the queue and the retire control unit is notified.
>
> On the event of "instruction executed", the retire control unit
flags the
> instruction as "ready to retire".
>
> Instruction are retired in program order; an "instruction
retired" event
> is sent
> to the register file which frees the temporary registers allocated for the
> instruction at register renaming stage.
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory operations.
> Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
> for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  The
> number
> of slots in the load/store queues is unknown by the tool, since there is no
> mention of it in the scheduling model.  In practice, users can specify flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
> to be
> equal to exactly N (an unsigned value). If N is zero, then the tool
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and stores.
> The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the load
> does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes that
> loads
> always may-alias store operations.  Essentially, this LSUnit doesn't
> perform any
> sort of alias analysis to rule out cases where loads and stores don't
> overlap
> with each other.  The downside of this approach however is that younger
> loads are
> never allowed to pass older stores.  To make it possible for a younger
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
> Note that, in the case of write-combining memory, rule 2. could be relaxed
> a bit
> to allow reordering of non-aliasing store operations.  That being said, at
> the
> moment, there is no way to further relax the memory model (flag -noalias
> is the
> only option).  Essentially, there is no option to specify a different
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory
types.
>  * LSUnit doesn't know how to identify serializing operations and
memory
> fences.
>
> No assumption is made on the store buffer size.  As mentioned before,
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, and
> it
> doesn't attempt to identify cases where store-to-load forwarding would
> occur in
> practice.
>
> LSUnit doesn't attempt to predict whether a load or store hits or
misses
> the L1
> cache.  It only knows if an instruction "MayLoad" and/or
"MayStore".  For
> loads,
> the scheduling model provides an "optimistic" load-to-use latency
(which
> usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations,
nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and
'UnmodeledSideEffects' behaves
> like a
> "soft" load-barrier.  That means, it serializes loads without
forcing a
> flush of
> the load queue.  Similarly, instructions flagged with both
'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full
memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'.
> This is
> inaccurate, but it is the best that we can do at the moment with the
> current
> information available in LLVM.
>
> A load/store barrier consumes one entry of the load/store queue.  A
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass a
> load
> barrier.  Also, a younger store cannot pass a store barrier.  A younger
> load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store
queue(s).
> That
> also means, by construction, all the older loads/stores have been executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is
set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag
'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing information to
> give
> an accurate report.  For example, the first sections of this document
> explained
> how the lack of knowledge about the processor negatively affects the
> performance
> analysis.  The lack of knowledge is often a consequence of how scheduling
> models
> are defined; as mentioned before, scheduling models intentionally don't
> describe
> processors in fine details.
>
> The accuracy of the performance analysis is also affected by assumptions
> made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of tight
> loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel'
in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to
describe
> loop
> buffers.  However, the purpose of that field is to enable loop unrolling of
> tight loops; essentially, it affects the cost model used by pass
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a
> processor,
> and consequently doesn't try to predict the frontend throughput.
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same
> reasons,
> this tool intentionally doesn't model branch prediction.  That being
said,
> this
> tool could be definitely extended in future to also account for the
> hardware
> frontend when doing performance analysis.  This would inevitably require
> extra
> (extensive) processor knowledge related to all the available decoding
> paths in
> the hardware frontend.
>
> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately visible to
> the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading.  According to
the
> tool,
> processor resources are not statically/dynamically partitioned.  Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of
> instructions.
> Instructions are placed in a queue and potentially executed out-of-order
> (based
> on the operand availability). The dispatch stage is definitely distinct
> from the
> issue stage.
>
> This model doesn't correctly describe processors where the
dispatch/issue
> is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it is
> up to
> the compiler to predict the latency of instructions and package issue
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
> to
> support processors with a single dispatch/issue stage. The execution flow
> would
> require some changes in the way how existing components (i.e.
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
> The following sections describes other known limitations.  The goal is not
> to
> provide an extensive list of limitations; we want to report what we
> believe are
> the most important limitations, and suggest possible methods to overcome
> them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already
mentioned
> how
> LLVM doesn't know about serializing operations and memory barriers. 
Most
> of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't
> expose
> those properties.  Instead, both serializing operations and memory barriers
> "have side-effects" according to MCInstrDesc.  That is because,
at least
> for
> scheduling purposes, knowing that an instruction has unmodeled side
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance report.
> One way
> to improve this is by reserving a couple of bits in field 'Flags'
from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations,
> since it
> affects a few out-of-order processors in LLVM.
>
> As mentioned in section 'Instruction Issue', class Scheduler
delegates to
> an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction stages.
> This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch
> targets.
> In particular, the tool doesn't model any sort of branch prediction,
nor
> does it
> attempt to track changes to the program counter.  The tool always assumes
> that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in
sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A warning
> is
> generated if the tool encounters a call instruction in the sequence.
> Return
> instructions are not evaluated, and therefore control flow is not affected.
> However, the tool still queries the processor scheduling model to obtain
> latency
> information for instructions that affect the control flow.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't
know about
> the
> register files, and temporaries available in each register file for
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe register
> files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how
many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the
retire
> control
> unit can retire every cycle.  Ideally, every processor should be able to
> specify
> the retire throughput (for example, by adding an extra field to the
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
>
> On x86-64, a 32-bit GPR write fully updates the super-register. Example:
>       add %edi %eax    ## eax += edi
>
> Here, register %eax aliases the lower half of 64-bit register %rax. On
> x86-64,
> register %rax is fully updated by the 'add' (the upper half of %rax
is
> zeroed).
> Essentially, it "kills" any previous definition of (the upper
half of)
> register
> %rax.
>
> On the other hand, 8/16 bit register writes only perform a so-called
> "partial
> register update". Example:
>       add %di, %ax     ## ax += di
>
> Here, register %eax is only partially updated. To be more specific, the
> lower
> half of %eax is set, and the upper half is left unchanged. There is also no
> change in the upper 48 bits of register %rax.
>
> To get accurate performance analysis, the tool has to know which
> instructions
> perform a partial register update, and which instructions fully update the
> destination's super-register.
>
> One way to expose this information is (again) via tablegen.  For example,
> we
> could add a flag in the tablegen instruction class to tag instructions that
> perform partial register updates. Something like this: 'bit
> hasPartialRegisterUpdate = 1'. However, this would force a `let
> hasPartialRegisterUpdate = 0` on several instruction definitions.
>
> Another approach is to have a MCSubtargetInfo hook similar to this:
>     virtual bool updatesSuperRegisters(unsigned short opcode) { return
> false; }
>
> Targets will be able to override this method if needed.  Again, this is
> just an
> idea. But the plan is to have this fixed as a future development.
>
> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors,
a
> 'cmp/test' followed by a 'jmp' is fused into a single macro
operation.  The
> advantage is that the fused pair only consumes a single slot in the
> dispatch
> group.
>
> As a future development, the tool should be extended to address
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the opcode
> pairs
> that can be fused together. That table could be exposed to the tool via the
> MCSubtargetInfo interface.  This is just an idea; there may be better ways
> to
> implement this.
>
> 3) Intel processors: mixing legacy SSE with AVX instructions.
>
> On modern Intel processors with AVX, mixing legacy SSE code with AVX code
> negatively impacts the performance.  The tool is not aware of this issue,
> and
> the performance penalty is not accounted when doing the analysis. This is
> something that we would like to improve in future.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance
> analysis.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting the
> tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a
"variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as
well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know
anything
> about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant scheduling
> class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework was
> meant
> to work for MachineInstr.
>
> When the tool encounters a "variant" instruction, it assumes a
generic 1cy
> latency. However, the tool would not be able to tell which processor
> resources
> are effectively consumed by the variant instruction.
>
> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to correctly
> predict the runtime performance of the code. This tool must always be able
> to
> obtain the set of implicit/explicit register defs/uses for every
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool takes
> as
> input an assembly sequence. That sequence is parsed into a MCInst sequence
> with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can
> inspect
> the MCOperand sequence of an MCInst to identify register operands. However,
> there is no way to tell register operands that are definitions from
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target instructions
> and
> their operands. The opcode of a machine instruction (a MachineInstr
> object) can
> be used to query the instruction set through method `MCInstrInfo::get'
to
> obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically
> generated
> via tablegen from the instruction set description in the target .td
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the
cardinality
> of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the machine
> operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from
> MachineInstr
> objects through a lowering step. By default the lowering logic simply
> iterates
> over the machine operands of a MachineInstr, and converts/expands them into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That means,
> register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific opcodes.
> Some of
> these routines may lower operands in a way that potentially breaks (some
> of) the
> assumptions on the machine operand sequence which were valid for
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the targets,
> and
> the vast majority of the MachineInstr are lowered based on the default
> strategy
> which preserves the original machine operand sequence.  This is especially
> true
> for x86, where the custom lowering logic always preserves the original
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other
> targets.
> So we plan (with the help of the community) to find a proper mechanism to
> map
> when possible MCOperand indices back to MachineOperand indices of the
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we could
> have an
> index for every register MCOperand (or -1, if the operand didn't exist
in
> the
> original MachineInstr). The mapping could look like this <0,1,3,2>. 
Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the
> tool on
> the register operand sequence (In general, these instructions should be the
> minority of a target's instruction set). Unfortunately, we don't
have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified by
> the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is always
> valid for MCInst. For example: implicit register defs, implicit register
> uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this link:
> https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor resource
> names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep working on
> the
> tool with the help of the community to address all of the limitations
> described
> by the previous sections, and find good solutions/fixes for the design
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
> really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180305/f99a65a3/attachment-0001.html>

Chris Lattner via llvm-dev

2018-Mar-06 06:01 UTC

head link

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

> 
> At Sony we developed an LLVM based performance analysis tool named
llvm-mca. We
> currently use it internally to statically measure the performance of code,
and
> to help triage potential problems with target scheduling models.  We
decided to
> post this RFC because we are interested in the feedback from the community,
and
> we also believe that other people might be interested in a tool like this.
This is really really cool.  Thank you for building and contributing this!

-Chris

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Mar 2018 - [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Apparently Analagous Threads