I have some IR that looks like this right before llc -O3 invokes LSR:
loop_header: ; preds
%loop_header.preheader1, %in_loop_2
%var_13_ = phi double [ %183, %in_loop_2 ], [ 0.000000e+00,
%loop_header.preheader1 ]
%var_17_ = phi i32 [ %184, %in_loop_2 ], [ %153, %loop_header.preheader1 ]
%171 = zext i32 %var_17_ to i64
%172 = icmp ult i32 %var_17_, %length.i820
br i1 %172, label %in_loop, label %exit
in_loop:
....
br %should_stay_in_loop, label %in_loop_2, label %exit
in_loop_2:
%180 = getelementptr inbounds double, double addrspace(1)* %26, i64 %171
%181 = load double, double addrspace(1)* %180, align 8
%182 = fmul double %178, %181
%183 = fadd double %var_13_, %182
%184 = add nsw i32 %var_17_, 1
%185 = icmp slt i32 %184, %157
br i1 %185, label %loop_header, label %outside.loopexit
If SCEV is unable to prove that %var_17_ is an nuw SCEV then all is
well, and the code generated for the BB %in_loop_2 looks like this:
mulsd 16(%rbx,%rcx,8), %xmm1
addsd %xmm1, %xmm0
leal 1(%rcx), %ebx
cmpl %r11d, %ebx
jl .LBB0_69
which is great and pretty much what I'd expect.
However, if SCEV can prove that %var_17_ is NUW (which it can via
%172, after http://reviews.llvm.org/rL233829) LSR optimizes %in_loop_2
to
in_loop_2:
%scevgep12 = getelementptr double, double addrspace(1)*
%scevgep1011, i64 %lsr.iv5
%177 = load double, double addrspace(1)* %scevgep12, align 8, !tbaa !24
%178 = fmul double %175, %177
%179 = fadd double %var_13_, %178
%lsr.iv.next6 = add nuw nsw i64 %lsr.iv5, 1
%180 = add i64 %153, %lsr.iv.next6
%tmp = trunc i64 %180 to i32
%181 = icmp slt i32 %tmp, %151
br i1 %181, label %not_zero146, label %bci_81.loopexit
where
%lsr.iv5 = phi i64 [ %lsr.iv.next6, %in_bounds161 ], [ 0,
%not_zero146.preheader1 ]
This is a regression -- the IR itself is more complicated and
generated machine code for %in_loop_2 is
mulsd (%r13,%rdi,8), %xmm1
addsd %xmm1, %xmm0
incq %rdi
movl %edi, %ecx
addl %r14d, %ecx
movq %r8, %rbx
cmpl %ebx, %ecx
jl .LBB0_69
which has two more instructions per iteration of the loop (one of
which is an add) that the first assembly listing.
As far as I can tell, the key issue here is that LSR does not consider
formulae of the form "(zext T)" for a use -- there is no
LSRInstance::GenerateZexts (or LSRInstance::GenerateSexts, for that
matter). Adding a LSRInstance::GenerateZexts modeled after
LSRInstance::GenerateTruncates seems to fix the issue. Does this
makes sense or am I missing something?
-- Sanjoy