Nagurne, James via llvm-dev
2021-Feb-12 20:40 UTC
[llvm-dev] MachinePipeliner: Representing operand latencies around the backedge
One fundamental calculation in software pipelining is the recurrence minimum ii (RecMII). The LLVM MachinePipeliner pass augments the DAG with edges to assist in this calculation in the function "updatePhiDependences", which notes that it must modify the DAG because "ScheduleDAGInstrs no longer processes dependences for PHIs". In this function, we find instructions in the loop header that def a register which is used in a phi at the top of the loop header. This indicates that the register is live across the backedge of the loop. When this happens, the function adds an anti-edge from the phi to the instruction with latency 1. There is more that this function does, but I will avoid talking about that since that anti-edge is the topic of my discussion/question. Consider an instruction that writes to a register in 2 cycles. The minimum amount of cycles that must occur before that value can be used is, therefore, 2. Now put this instruction near the bottom of a loop that is a candidate for software pipelining. Header: %0 = PHI %1, Preheader, %2, Header ... %2 = two_cycle_op %0 ... ; terminators Given the calculations as-written now, there will be an anti-edge between the PHI and two_cycle_op of latency 1, leading to a RecMII of 1. Now, unroll this loop once: Header: %0 = PHI %1, Preheader, %3, Header ... %2 = two_cycle_op %0 ... %3 = two_cycle_op %2 ... ; terminators Since the latency between the def of %2 and its use is 2, and the anti-edge between the PHI and def of %3 is 1, we get a RecMII of 3. This shows that the algorithm is treating the backedge latency differently than it would otherwise have been treated in straight-line code. I have trialed a potentially upstreamable solution that utilizes computeOperandLatency and works like the following: Consider the following loop where two iterations have been linearized together for clarity: loop_header: ; Iteration n |%0 = PHI %1, outside_loop, %2, loop_header 1 |%2 = do_something %0 ; Iteration n+1 2 |%0 = PHI %1, outside_loop, %2, loop_header 3 |%2 = do_something %0 %2 of instruction 1 is the original def. It is defined in iteration n and used in iteration n+1. %0 of instruction 2 is the phi def. It is the representation of OrigDef coming from a previous iteration. %0 of instruction 3 is the true use of the original def. We want to get the latency between the original def and this operand. If there are multiple true (non-PHI) uses of the original def, take the maximum. If anyone has history or comments to add on the original approach, or would like to talk more about my approach/upstreaming the change in a review in some form, please let me know. J.B. Nagurne Code Generation Texas Instruments -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210212/f5bac5aa/attachment.html>