Hello llvm-dev,
I'm currently integrating an experimental instruction scheduler into
LLVM and have come upon a point of confusion regarding WAW
dependencies.
I apologize in advance for the unnecessarily convoluted example, but
it's one solid case which fails on my scheduler and where I can get
LLVM to produce a graph that shows my question clearly.
I have a small piece of code like this which I'm compiling for x86-64 with
-O3:
--------------------------------------------------
unsigned long long x = 0;
for (long long i=0; i<500000; ++i) {
i += i;
x -= 123;
x *= i;
x += 456;
}
--------------------------------------------------
The optimizer removes the i += i; operation, hardcodes the fact that
there are 19 loops, and counts down instead of up. This is irrelevant
to the problem, but makes the graph clearer.
The loop body produces the following SelectionDAG:
http://i.imgur.com/tmJBZ.png
The JNE_4 near the root of the graph depends on the flag produced by
the DEC64_32r, through a CopyToReg node. Other nodes that write to the
flags, such as the ADD64rr nodes on the left are not linked to the
DEC64_32rr node with a WAW edge. I'm assuming that this is because
they are not hardcoded to output to EFLAGS, but rather to a virtual
register, and the aforementioned CopyToReg node takes the virtual
output from DEC64_32rr and puts it in EFLAGS.
The SelectionDAG is then converted to the following ScheduleDAG:
http://i.imgur.com/S1uWQ.png
Here we can see that the JNE_4 and its CopyToReg neighbor are merged
into a single SUnit and are linked to DEC64_32rr with a data
dependency and to TokenFactor with an order dependency. There's still
no indication that TokenFactor or any of the nodes above it may
overwrite the flags created by DEC64_32rr, again, presumably because
JNE_4's CopyToReg takes care of moving DEC64_32rr's flag output to
EFLAGS.
Now we come to the actual problem. As far as the graph is concerned,
there is nothing preventing us from scheduling SU2 (DEC64_32rr) before
any of the other flag-affecting nodes, such as SU10. One such schedule
is 8-13-12-14-3-11-2-7-10-5-6-9-4-1-0 (which is, incidentally, what my
scheduler produces). However, when we look at the code generated from
such a schedule, we see this:
--------------------------------------------------
.LBB0_1: # %bb
# =>This Inner Loop Header: Depth=1
addq $-123, %rdx
leaq (%rax,%rax), %rsi
imulq %rdx, %rsi
decl %ecx
leaq 1(%rax,%rax), %rax
addq $456, %rsi # imm = 0x1C8
movq %rsi, %rdx
jne .LBB0_1
--------------------------------------------------
In this case, the flags set by "decl %ecx" are overwritten by the ones
produced by "addq $456, %rsi", which causes "jne .LBB0_1" to
use the
wrong value. The CopyToReg that was supposed to deliver decl's flags
output to jne is nowhere to be seen. I understand that it's usually a
conceptual rather than physical instruction and is optimized out, but
given the above schedule, it is necessary for code correctness.
Compiling the same piece of code using the list-burr or fast
schedulers produces valid code, so they must be able to recognize the
hidden dependency. However, I can't see how they are doing it. Can
someone please shed some light on this issue?
Thanks!
Regards,
Max