thr3ads.net - llvm dev - [llvm-dev] LoopStrengthReduction generates false code [Jun 2020]

If this information is useful, please help other people find it:
Share via:

Boris Boesler via llvm-dev

2020-Jun-09 18:59 UTC

[llvm-dev] LoopStrengthReduction generates false code

Hm, no. I expect byte addresses - everywhere. The compiler should not know that
the arch needs word addresses. During lowering LOAD and STORE get explicit
conversion operations for the memory address. Even if my arch was byte addressed
the code would be false/illegal.

Boris
> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at
quicinc.com>:
> 
> Blindly guessing here, "memory is not byte addressed", but you
never fixed ScalarEvolution to handle that, so it's modeling the GEP in a
way you're not expecting.
> 
> -Eli
> 
>> -----Original Message-----
>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Boris Boesler
>> via llvm-dev
>> Sent: Tuesday, June 9, 2020 1:17 AM
>> To: llvm-dev at lists.llvm.org
>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>> 
>> Hi.
>> 
>> In my backend I get false code after using StrengthLoopReduction. In
the
>> generated code the loop index variable is multiplied by 8 (correct,
everything
>> is 64 bit aligned) to get an address offset, and the index variable is
>> incremented by 1*8, which is not correct. It should be incremented by 1
>> only. The factor 8 appears again.
>> 
>> I compared the debug output (-debug-only=scalar-evolution,loop-reduce)
for
>> my backend and the ARM backend, but simply can't read/understand
it.
>> They differ in factors 4 vs 8 (ok), but there are more differences,
probably
>> caused by the implementation of TargetTransformInfo for ARM, while I
>> haven't implemented it for my arch, yet.
>> 
>> How can I debug this further? In my arch everything is 64 bit aligned
(factor 8
>> in many passes) and the memory is not byte addressed.
>> 
>> Thanks,
>> Boris
>> 
>> ----8<----
>> 
>> LLVM assembly:
>> 
>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>> 
>> ; Function Attrs: nounwind
>> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
>> entry:
>>  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32],
[10 x i32]*
>> @buffer, i32 0, i32 0)) #2
>>  br label %while.body
>> 
>> while.body:                                       ; preds = %entry,
%while.body
>>  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>>  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer,
i32 0,
>> i32 %i.010
>>  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>>  %cmp1 = icmp ne i32 %0, -559038737
>>  %inc = add nuw nsw i32 %i.010, 1
>>  %cmp11 = icmp eq i32 %i.010, 0
>>  %cmp = or i1 %cmp11, %cmp1
>>  br i1 %cmp, label %while.body, label %while.end
>> 
>> while.end:                                        ; preds = %while.body
>>  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer,
i32 0,
>> i32 %i.010
>>  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>>  store volatile i32 %1, i32* %result, align 4, !tbaa !2
>>  ret void
>> }
>> 
>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>> 
>> attributes #0 = { nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
>> "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>> "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-
>> math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
>> "unsafe-fp-math"="false"
"use-soft-float"="false" }
>> attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-
>> tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>> "no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false" "no-trapping-
>> math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
>> "use-soft-float"="false" }
>> attributes #2 = { nounwind }
>> 
>> !llvm.module.flags = !{!0}
>> !llvm.ident = !{!1}
>> 
>> !0 = !{i32 1, !"wchar_size", i32 4}
>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
>> !2 = !{!3, !3, i64 0}
>> !3 = !{!"int", !4, i64 0}
>> !4 = !{!"omnipotent char", !5, i64 0}
>> !5 = !{!"Simple C/C++ TBAA"}
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,8}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body
]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *8
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup
type: i32
>>  Filtering out formula
reg({0,+,1}<nuw><nsw><%while.body>)
>>    in favor of formula reg({0,+,-1}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>> widest fixup type: i32*
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> New best at 2 instructions 2 regs, with addrec cost 2.
>> Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>> 
>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1,
plus 1
>> base add:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,4}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body
]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *4
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup
type: i32
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>> widest fixup type: i32*
>>  Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>>  Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>> New best at 1 instruction 2 regs, with addrec cost 1.
>> Regs: {0,+,-1}<nw><%while.body> @buffer
>> 
>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Eli Friedman via llvm-dev

2020-Jun-09 19:56 UTC

head link

[llvm-dev] LoopStrengthReduction generates false code

Hmm.  Then I'm not sure; at first glance, the debug output looks fine either
way.  Could you show the IR after LSR, and explain why it's wrong?

-Eli
> -----Original Message-----
> From: Boris Boesler <baembel at gmx.de>
> Sent: Tuesday, June 9, 2020 11:59 AM
> To: Eli Friedman <efriedma at quicinc.com>
> Cc: llvm-dev at lists.llvm.org
> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code
>
> Hm, no. I expect byte addresses - everywhere. The compiler should not know
> that the arch needs word addresses. During lowering LOAD and STORE get
> explicit conversion operations for the memory address. Even if my arch was
> byte addressed the code would be false/illegal.
>
> Boris
>
> > Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at
quicinc.com>:
> >
> > Blindly guessing here, "memory is not byte addressed", but
you never fixed
> ScalarEvolution to handle that, so it's modeling the GEP in a way
you're not
> expecting.
> >
> > -Eli
> >
> >> -----Original Message-----
> >> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On
Behalf Of Boris
> Boesler
> >> via llvm-dev
> >> Sent: Tuesday, June 9, 2020 1:17 AM
> >> To: llvm-dev at lists.llvm.org
> >> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false
code
> >>
> >> Hi.
> >>
> >> In my backend I get false code after using StrengthLoopReduction.
In the
> >> generated code the loop index variable is multiplied by 8
(correct,
> everything
> >> is 64 bit aligned) to get an address offset, and the index
variable is
> >> incremented by 1*8, which is not correct. It should be incremented
by 1
> >> only. The factor 8 appears again.
> >>
> >> I compared the debug output
(-debug-only=scalar-evolution,loop-reduce)
> for
> >> my backend and the ARM backend, but simply can't
read/understand it.
> >> They differ in factors 4 vs 8 (ok), but there are more
differences, probably
> >> caused by the implementation of TargetTransformInfo for ARM, while
I
> >> haven't implemented it for my arch, yet.
> >>
> >> How can I debug this further? In my arch everything is 64 bit
aligned
> (factor 8
> >> in many passes) and the memory is not byte addressed.
> >>
> >> Thanks,
> >> Boris
> >>
> >> ----8<----
> >>
> >> LLVM assembly:
> >>
> >> @buffer = common dso_local global [10 x i32] zeroinitializer,
align 4
> >>
> >> ; Function Attrs: nounwind
> >> define dso_local void @some_main(i32* %result) local_unnamed_addr
#0
> {
> >> entry:
> >>  tail call void @fill_array(i32* getelementptr inbounds ([10 x
i32], [10 x
> i32]*
> >> @buffer, i32 0, i32 0)) #2
> >>  br label %while.body
> >>
> >> while.body:                                       ; preds =
%entry, %while.body
> >>  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
> >>  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]*
@buffer, i32 0,
> >> i32 %i.010
> >>  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
> >>  %cmp1 = icmp ne i32 %0, -559038737
> >>  %inc = add nuw nsw i32 %i.010, 1
> >>  %cmp11 = icmp eq i32 %i.010, 0
> >>  %cmp = or i1 %cmp11, %cmp1
> >>  br i1 %cmp, label %while.body, label %while.end
> >>
> >> while.end:                                        ; preds =
%while.body
> >>  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]*
@buffer, i32
> 0,
> >> i32 %i.010
> >>  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
> >>  store volatile i32 %1, i32* %result, align 4, !tbaa !2
> >>  ret void
> >> }
> >>
> >> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
> >>
> >> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-
> math"="false"
> >> "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-
> pointer-
> >> elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
> >> "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-
> >> math"="false"
"no-trapping-math"="false" "stack-protector-buffer-
> size"="8"
> >> "unsafe-fp-math"="false"
"use-soft-float"="false" }
> >> attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-
> >> tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
> >> elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
> >> "no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false" "no-
> trapping-
> >> math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
> >> "use-soft-float"="false" }
> >> attributes #2 = { nounwind }
> >>
> >> !llvm.module.flags = !{!0}
> >> !llvm.ident = !{!1}
> >>
> >> !0 = !{i32 1, !"wchar_size", i32 4}
> >> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
> >> !2 = !{!3, !3, i64 0}
> >> !3 = !{!"int", !4, i64 0}
> >> !4 = !{!"omnipotent char", !5, i64 0}
> >> !5 = !{!"Simple C/C++ TBAA"}
> >>
> >>
> >> (-debug-only=scalar-evolution,loop-reduce) for my arch:
> >>
> >> LSR on loop %while.body:
> >> Collecting IV Chains.
> >> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa
!2)
> >> IV={@buffer,+,8}<nsw><%while.body>
> >> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> >> IV={0,+,1}<nuw><nsw><%while.body>
> >> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc,
%while.body ]) IV+1
> >> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> >> LSR has identified the following interesting factors and types: *8
> >> LSR is examining the following fixup sites:
> >>  UserInst=%cmp11, OperandValToReplace=%i.010
> >>  UserInst=%0, OperandValToReplace=%arrayidx
> >> LSR found 2 uses:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,8}<nsw><%while.body>)
> >>
> >> After generating reuse formulae:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>    reg({0,+,8}<nuw><nsw><%while.body>)
> >>    reg({0,+,1}<nuw><nsw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,8}<nsw><%while.body>)
> >>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
> >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest
fixup type:
> i32
> >>  Filtering out formula
reg({0,+,1}<nuw><nsw><%while.body>)
> >>    in favor of formula reg({0,+,-1}<nw><%while.body>)
> >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
> >> widest fixup type: i32*
> >>
> >> After filtering out undesirable candidates:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>    reg({0,+,8}<nuw><nsw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,8}<nsw><%while.body>)
> >>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
> >> New best at 2 instructions 2 regs, with addrec cost 2.
> >> Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
> >> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base
add.
> >> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
> >>
> >> The chosen solution requires 2 instructions 2 regs, with addrec
cost 1,
> plus 1
> >> base add:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,8}<nuw><nsw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
> >>
> >>
> >> (-debug-only=scalar-evolution,loop-reduce) for ARM:
> >>
> >> LSR on loop %while.body:
> >> Collecting IV Chains.
> >> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa
!2)
> >> IV={@buffer,+,4}<nsw><%while.body>
> >> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> >> IV={0,+,1}<nuw><nsw><%while.body>
> >> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc,
%while.body ]) IV+1
> >> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> >> LSR has identified the following interesting factors and types: *4
> >> LSR is examining the following fixup sites:
> >>  UserInst=%cmp11, OperandValToReplace=%i.010
> >>  UserInst=%0, OperandValToReplace=%arrayidx
> >> LSR found 2 uses:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,4}<nsw><%while.body>)
> >>
> >> After generating reuse formulae:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>    reg({0,+,4}<nuw><nsw><%while.body>)
> >>    reg({0,+,1}<nuw><nsw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,4}<nsw><%while.body>)
> >>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
> >>    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
> >>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
> >>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
> >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest
fixup type:
> i32
> >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
> >> widest fixup type: i32*
> >>  Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
> >>    in favor of formula
reg({@buffer,+,4}<nsw><%while.body>)
> >>  Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
> >>    in favor of formula
reg({@buffer,+,4}<nsw><%while.body>)
> >>
> >> After filtering out undesirable candidates:
> >> LSR is examining the following uses:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>    reg({0,+,4}<nuw><nsw><%while.body>)
> >>    reg({0,+,1}<nuw><nsw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg({@buffer,+,4}<nsw><%while.body>)
> >>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
> >>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
> >> New best at 1 instruction 2 regs, with addrec cost 1.
> >> Regs: {0,+,-1}<nw><%while.body> @buffer
> >>
> >> The chosen solution requires 1 instruction 2 regs, with addrec
cost 1:
> >>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
> >>    reg({0,+,-1}<nw><%while.body>)
> >>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup
> type:
> >> i32*
> >>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Boris Boesler via llvm-dev

2020-Jun-10 08:58 UTC

head link

[llvm-dev] LoopStrengthReduction generates false code

The IR after LSR is:

*** IR Dump After Loop Strength Reduction ***
; Preheader:
entry:
  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x
i32]* @buffer, i32 0, i32 0)) #2
  br label %while.body

; Loop:
while.body:                                       ; preds = %while.body, %entry
  %lsr.iv = phi i32 [ %lsr.iv.next, %while.body ], [ 0, %entry ]
  %uglygep = getelementptr i8, i8* bitcast ([10 x i32]* @buffer to i8*), i32
%lsr.iv
  %uglygep1 = bitcast i8* %uglygep to i32*
  %0 = load i32, i32* %uglygep1, align 4, !tbaa !2
  %cmp1 = icmp ne i32 %0, -559038737
  %cmp11 = icmp eq i32 %lsr.iv, 0
  %cmp = or i1 %cmp11, %cmp1
  %lsr.iv.next = add nuw i32 %lsr.iv, 8
  br i1 %cmp, label %while.body, label %while.end

; Exit blocks
while.end:                                        ; preds = %while.body
  store volatile i32 %0, i32* %result, align 4, !tbaa !2
  ret void

I guess "%uglygep = getelementptr.." will be lowered to @buffer +
(%lsr.iv * StoreSize(i32)). That's what I see in the final code. But then
%lsr.iv.next should be incremented by 1; BUT it is incremented by 8.

Incrementing %lsr.iv.next by 8 would make sense if getelementptr were lowered to
@buffer + %lsr.iv.

Thanks for your help,
Boris



> Am 09.06.2020 um 21:56 schrieb Eli Friedman <efriedma at
quicinc.com>:
> 
> Hmm.  Then I'm not sure; at first glance, the debug output looks fine
either way.  Could you show the IR after LSR, and explain why it's wrong?
> 
> -Eli
> 
>> -----Original Message-----
>> From: Boris Boesler <baembel at gmx.de>
>> Sent: Tuesday, June 9, 2020 11:59 AM
>> To: Eli Friedman <efriedma at quicinc.com>
>> Cc: llvm-dev at lists.llvm.org
>> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false
code
>> 
>> Hm, no. I expect byte addresses - everywhere. The compiler should not
know
>> that the arch needs word addresses. During lowering LOAD and STORE get
>> explicit conversion operations for the memory address. Even if my arch
was
>> byte addressed the code would be false/illegal.
>> 
>> Boris
>> 
>>> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at
quicinc.com>:
>>> 
>>> Blindly guessing here, "memory is not byte addressed",
but you never fixed
>> ScalarEvolution to handle that, so it's modeling the GEP in a way
you're not
>> expecting.
>>> 
>>> -Eli
>>> 
>>>> -----Original Message-----
>>>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On
Behalf Of Boris
>> Boesler
>>>> via llvm-dev
>>>> Sent: Tuesday, June 9, 2020 1:17 AM
>>>> To: llvm-dev at lists.llvm.org
>>>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false
code
>>>> 
>>>> Hi.
>>>> 
>>>> In my backend I get false code after using
StrengthLoopReduction. In the
>>>> generated code the loop index variable is multiplied by 8
(correct,
>> everything
>>>> is 64 bit aligned) to get an address offset, and the index
variable is
>>>> incremented by 1*8, which is not correct. It should be
incremented by 1
>>>> only. The factor 8 appears again.
>>>> 
>>>> I compared the debug output
(-debug-only=scalar-evolution,loop-reduce)
>> for
>>>> my backend and the ARM backend, but simply can't
read/understand it.
>>>> They differ in factors 4 vs 8 (ok), but there are more
differences, probably
>>>> caused by the implementation of TargetTransformInfo for ARM,
while I
>>>> haven't implemented it for my arch, yet.
>>>> 
>>>> How can I debug this further? In my arch everything is 64 bit
aligned
>> (factor 8
>>>> in many passes) and the memory is not byte addressed.
>>>> 
>>>> Thanks,
>>>> Boris
>>>> 
>>>> ----8<----
>>>> 
>>>> LLVM assembly:
>>>> 
>>>> @buffer = common dso_local global [10 x i32] zeroinitializer,
align 4
>>>> 
>>>> ; Function Attrs: nounwind
>>>> define dso_local void @some_main(i32* %result)
local_unnamed_addr #0
>> {
>>>> entry:
>>>> tail call void @fill_array(i32* getelementptr inbounds ([10 x
i32], [10 x
>> i32]*
>>>> @buffer, i32 0, i32 0)) #2
>>>> br label %while.body
>>>> 
>>>> while.body:                                       ; preds =
%entry, %while.body
>>>> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>>>> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]*
@buffer, i32 0,
>>>> i32 %i.010
>>>> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>>>> %cmp1 = icmp ne i32 %0, -559038737
>>>> %inc = add nuw nsw i32 %i.010, 1
>>>> %cmp11 = icmp eq i32 %i.010, 0
>>>> %cmp = or i1 %cmp11, %cmp1
>>>> br i1 %cmp, label %while.body, label %while.end
>>>> 
>>>> while.end:                                        ; preds =
%while.body
>>>> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]*
@buffer, i32
>> 0,
>>>> i32 %i.010
>>>> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>>>> store volatile i32 %1, i32* %result, align 4, !tbaa !2
>>>> ret void
>>>> }
>>>> 
>>>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>>>> 
>>>> attributes #0 = { nounwind
"correctly-rounded-divide-sqrt-fp-
>> math"="false"
>>>> "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-
>> pointer-
>>>> elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>>>> "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-
>>>> math"="false"
"no-trapping-math"="false" "stack-protector-buffer-
>> size"="8"
>>>> "unsafe-fp-math"="false"
"use-soft-float"="false" }
>>>> attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-
>>>> tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
>>>> elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>>>> "no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false" "no-
>> trapping-
>>>> math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
>>>> "use-soft-float"="false" }
>>>> attributes #2 = { nounwind }
>>>> 
>>>> !llvm.module.flags = !{!0}
>>>> !llvm.ident = !{!1}
>>>> 
>>>> !0 = !{i32 1, !"wchar_size", i32 4}
>>>> !1 = !{!"clang version 7.0.1
(tags/RELEASE_701/final)"}
>>>> !2 = !{!3, !3, i64 0}
>>>> !3 = !{!"int", !4, i64 0}
>>>> !4 = !{!"omnipotent char", !5, i64 0}
>>>> !5 = !{!"Simple C/C++ TBAA"}
>>>> 
>>>> 
>>>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>>>> 
>>>> LSR on loop %while.body:
>>>> Collecting IV Chains.
>>>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4,
!tbaa !2)
>>>> IV={@buffer,+,8}<nsw><%while.body>
>>>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>>>> IV={0,+,1}<nuw><nsw><%while.body>
>>>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc,
%while.body ]) IV+1
>>>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>>>> LSR has identified the following interesting factors and types:
*8
>>>> LSR is examining the following fixup sites:
>>>> UserInst=%cmp11, OperandValToReplace=%i.010
>>>> UserInst=%0, OperandValToReplace=%arrayidx
>>>> LSR found 2 uses:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,8}<nsw><%while.body>)
>>>> 
>>>> After generating reuse formulae:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>>   reg({0,+,8}<nuw><nsw><%while.body>)
>>>>   reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,8}<nsw><%while.body>)
>>>>   reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest
fixup type:
>> i32
>>>> Filtering out formula
reg({0,+,1}<nuw><nsw><%while.body>)
>>>>   in favor of formula
reg({0,+,-1}<nw><%while.body>)
>>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>>>> widest fixup type: i32*
>>>> 
>>>> After filtering out undesirable candidates:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>>   reg({0,+,8}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,8}<nsw><%while.body>)
>>>>   reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>> New best at 2 instructions 2 regs, with addrec cost 2.
>>>> Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
>>>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1
base add.
>>>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>>>> 
>>>> The chosen solution requires 2 instructions 2 regs, with addrec
cost 1,
>> plus 1
>>>> base add:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,8}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>>>> 
>>>> 
>>>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>>>> 
>>>> LSR on loop %while.body:
>>>> Collecting IV Chains.
>>>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4,
!tbaa !2)
>>>> IV={@buffer,+,4}<nsw><%while.body>
>>>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>>>> IV={0,+,1}<nuw><nsw><%while.body>
>>>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc,
%while.body ]) IV+1
>>>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>>>> LSR has identified the following interesting factors and types:
*4
>>>> LSR is examining the following fixup sites:
>>>> UserInst=%cmp11, OperandValToReplace=%i.010
>>>> UserInst=%0, OperandValToReplace=%arrayidx
>>>> LSR found 2 uses:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,4}<nsw><%while.body>)
>>>> 
>>>> After generating reuse formulae:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>>   reg({0,+,4}<nuw><nsw><%while.body>)
>>>>   reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,4}<nsw><%while.body>)
>>>>   reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>>>   -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>>>   reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>>   reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest
fixup type:
>> i32
>>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>>>> widest fixup type: i32*
>>>> Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
>>>>   in favor of formula
reg({@buffer,+,4}<nsw><%while.body>)
>>>> Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
>>>>   in favor of formula
reg({@buffer,+,4}<nsw><%while.body>)
>>>> 
>>>> After filtering out undesirable candidates:
>>>> LSR is examining the following uses:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>>   reg({0,+,4}<nuw><nsw><%while.body>)
>>>>   reg({0,+,1}<nuw><nsw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg({@buffer,+,4}<nsw><%while.body>)
>>>>   reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>>   reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>>>> New best at 1 instruction 2 regs, with addrec cost 1.
>>>> Regs: {0,+,-1}<nw><%while.body> @buffer
>>>> 
>>>> The chosen solution requires 1 instruction 2 regs, with addrec
cost 1:
>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>   reg({0,+,-1}<nw><%while.body>)
>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup
>> type:
>>>> i32*
>>>>   reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

llvm dev - Jun 2020 - LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code