Nagurne, James via llvm-dev
2021-Feb-12 20:40 UTC
[llvm-dev] MachinePipeliner: Representing operand latencies around the backedge
One fundamental calculation in software pipelining is the recurrence minimum ii
(RecMII). The LLVM MachinePipeliner pass augments the DAG with edges to assist
in this calculation in the function "updatePhiDependences", which
notes that it must modify the DAG because "ScheduleDAGInstrs no longer
processes dependences for PHIs".
In this function, we find instructions in the loop header that def a register
which is used in a phi at the top of the loop header. This indicates that the
register is live across the backedge of the loop. When this happens, the
function adds an anti-edge from the phi to the instruction with latency 1. There
is more that this function does, but I will avoid talking about that since that
anti-edge is the topic of my discussion/question.
Consider an instruction that writes to a register in 2 cycles. The minimum
amount of cycles that must occur before that value can be used is, therefore, 2.
Now put this instruction near the bottom of a loop that is a candidate for
software pipelining.
Header:
%0 = PHI %1, Preheader, %2, Header
...
%2 = two_cycle_op %0
... ; terminators
Given the calculations as-written now, there will be an anti-edge between the
PHI and two_cycle_op of latency 1, leading to a RecMII of 1.
Now, unroll this loop once:
Header:
%0 = PHI %1, Preheader, %3, Header
...
%2 = two_cycle_op %0
...
%3 = two_cycle_op %2
... ; terminators
Since the latency between the def of %2 and its use is 2, and the anti-edge
between the PHI and def of %3 is 1, we get a RecMII of 3.
This shows that the algorithm is treating the backedge latency differently than
it would otherwise have been treated in straight-line code.
I have trialed a potentially upstreamable solution that utilizes
computeOperandLatency and works like the following:
Consider the following loop where two iterations have been linearized together
for clarity:
loop_header:
; Iteration n
|%0 = PHI %1, outside_loop, %2, loop_header
1 |%2 = do_something %0
; Iteration n+1
2 |%0 = PHI %1, outside_loop, %2, loop_header
3 |%2 = do_something %0
%2 of instruction 1 is the original def. It is defined in iteration n and used
in iteration n+1.
%0 of instruction 2 is the phi def. It is the representation of OrigDef coming
from a previous iteration.
%0 of instruction 3 is the true use of the original def. We want to get the
latency between the original def and this operand.
If there are multiple true (non-PHI) uses of the original def, take the maximum.
If anyone has history or comments to add on the original approach, or would like
to talk more about my approach/upstreaming the change in a review in some form,
please let me know.
J.B. Nagurne
Code Generation
Texas Instruments
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210212/f5bac5aa/attachment.html>