Boris Boesler via llvm-dev
2020-Jun-09 18:59 UTC
[llvm-dev] LoopStrengthReduction generates false code
Hm, no. I expect byte addresses - everywhere. The compiler should not know that the arch needs word addresses. During lowering LOAD and STORE get explicit conversion operations for the memory address. Even if my arch was byte addressed the code would be false/illegal. Boris> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>: > > Blindly guessing here, "memory is not byte addressed", but you never fixed ScalarEvolution to handle that, so it's modeling the GEP in a way you're not expecting. > > -Eli > >> -----Original Message----- >> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris Boesler >> via llvm-dev >> Sent: Tuesday, June 9, 2020 1:17 AM >> To: llvm-dev at lists.llvm.org >> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code >> >> Hi. >> >> In my backend I get false code after using StrengthLoopReduction. In the >> generated code the loop index variable is multiplied by 8 (correct, everything >> is 64 bit aligned) to get an address offset, and the index variable is >> incremented by 1*8, which is not correct. It should be incremented by 1 >> only. The factor 8 appears again. >> >> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for >> my backend and the ARM backend, but simply can't read/understand it. >> They differ in factors 4 vs 8 (ok), but there are more differences, probably >> caused by the implementation of TargetTransformInfo for ARM, while I >> haven't implemented it for my arch, yet. >> >> How can I debug this further? In my arch everything is 64 bit aligned (factor 8 >> in many passes) and the memory is not byte addressed. >> >> Thanks, >> Boris >> >> ----8<---- >> >> LLVM assembly: >> >> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4 >> >> ; Function Attrs: nounwind >> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 { >> entry: >> tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]* >> @buffer, i32 0, i32 0)) #2 >> br label %while.body >> >> while.body: ; preds = %entry, %while.body >> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ] >> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, >> i32 %i.010 >> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2 >> %cmp1 = icmp ne i32 %0, -559038737 >> %inc = add nuw nsw i32 %i.010, 1 >> %cmp11 = icmp eq i32 %i.010, 0 >> %cmp = or i1 %cmp11, %cmp1 >> br i1 %cmp, label %while.body, label %while.end >> >> while.end: ; preds = %while.body >> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, >> i32 %i.010 >> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2 >> store volatile i32 %1, i32* %result, align 4, !tbaa !2 >> ret void >> } >> >> declare dso_local void @fill_array(i32*) local_unnamed_addr #1 >> >> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" >> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer- >> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" >> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp- >> math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" >> "unsafe-fp-math"="false" "use-soft-float"="false" } >> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable- >> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer- >> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" >> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping- >> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" >> "use-soft-float"="false" } >> attributes #2 = { nounwind } >> >> !llvm.module.flags = !{!0} >> !llvm.ident = !{!1} >> >> !0 = !{i32 1, !"wchar_size", i32 4} >> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"} >> !2 = !{!3, !3, i64 0} >> !3 = !{!"int", !4, i64 0} >> !4 = !{!"omnipotent char", !5, i64 0} >> !5 = !{!"Simple C/C++ TBAA"} >> >> >> (-debug-only=scalar-evolution,loop-reduce) for my arch: >> >> LSR on loop %while.body: >> Collecting IV Chains. >> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) >> IV={@buffer,+,8}<nsw><%while.body> >> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) >> IV={0,+,1}<nuw><nsw><%while.body> >> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 >> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 >> LSR has identified the following interesting factors and types: *8 >> LSR is examining the following fixup sites: >> UserInst=%cmp11, OperandValToReplace=%i.010 >> UserInst=%0, OperandValToReplace=%arrayidx >> LSR found 2 uses: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,8}<nsw><%while.body>) >> >> After generating reuse formulae: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> reg({0,+,8}<nuw><nsw><%while.body>) >> reg({0,+,1}<nuw><nsw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,8}<nsw><%while.body>) >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>) >> in favor of formula reg({0,+,-1}<nw><%while.body>) >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, >> widest fixup type: i32* >> >> After filtering out undesirable candidates: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> reg({0,+,8}<nuw><nsw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,8}<nsw><%while.body>) >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >> New best at 2 instructions 2 regs, with addrec cost 2. >> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body> >> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add. >> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer >> >> The chosen solution requires 2 instructions 2 regs, with addrec cost 1, plus 1 >> base add: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,8}<nuw><nsw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >> >> >> (-debug-only=scalar-evolution,loop-reduce) for ARM: >> >> LSR on loop %while.body: >> Collecting IV Chains. >> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) >> IV={@buffer,+,4}<nsw><%while.body> >> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) >> IV={0,+,1}<nuw><nsw><%while.body> >> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 >> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 >> LSR has identified the following interesting factors and types: *4 >> LSR is examining the following fixup sites: >> UserInst=%cmp11, OperandValToReplace=%i.010 >> UserInst=%0, OperandValToReplace=%arrayidx >> LSR found 2 uses: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,4}<nsw><%while.body>) >> >> After generating reuse formulae: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> reg({0,+,4}<nuw><nsw><%while.body>) >> reg({0,+,1}<nuw><nsw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,4}<nsw><%while.body>) >> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) >> -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) >> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >> reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, >> widest fixup type: i32* >> Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) >> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) >> Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) >> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) >> >> After filtering out undesirable candidates: >> LSR is examining the following uses: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> reg({0,+,4}<nuw><nsw><%while.body>) >> reg({0,+,1}<nuw><nsw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg({@buffer,+,4}<nsw><%while.body>) >> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) >> New best at 1 instruction 2 regs, with addrec cost 1. >> Regs: {0,+,-1}<nw><%while.body> @buffer >> >> The chosen solution requires 1 instruction 2 regs, with addrec cost 1: >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >> reg({0,+,-1}<nw><%while.body>) >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: >> i32* >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Eli Friedman via llvm-dev
2020-Jun-09 19:56 UTC
[llvm-dev] LoopStrengthReduction generates false code
Hmm. Then I'm not sure; at first glance, the debug output looks fine either way. Could you show the IR after LSR, and explain why it's wrong? -Eli> -----Original Message----- > From: Boris Boesler <baembel at gmx.de> > Sent: Tuesday, June 9, 2020 11:59 AM > To: Eli Friedman <efriedma at quicinc.com> > Cc: llvm-dev at lists.llvm.org > Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code > > Hm, no. I expect byte addresses - everywhere. The compiler should not know > that the arch needs word addresses. During lowering LOAD and STORE get > explicit conversion operations for the memory address. Even if my arch was > byte addressed the code would be false/illegal. > > Boris > > > Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>: > > > > Blindly guessing here, "memory is not byte addressed", but you never fixed > ScalarEvolution to handle that, so it's modeling the GEP in a way you're not > expecting. > > > > -Eli > > > >> -----Original Message----- > >> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris > Boesler > >> via llvm-dev > >> Sent: Tuesday, June 9, 2020 1:17 AM > >> To: llvm-dev at lists.llvm.org > >> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code > >> > >> Hi. > >> > >> In my backend I get false code after using StrengthLoopReduction. In the > >> generated code the loop index variable is multiplied by 8 (correct, > everything > >> is 64 bit aligned) to get an address offset, and the index variable is > >> incremented by 1*8, which is not correct. It should be incremented by 1 > >> only. The factor 8 appears again. > >> > >> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) > for > >> my backend and the ARM backend, but simply can't read/understand it. > >> They differ in factors 4 vs 8 (ok), but there are more differences, probably > >> caused by the implementation of TargetTransformInfo for ARM, while I > >> haven't implemented it for my arch, yet. > >> > >> How can I debug this further? In my arch everything is 64 bit aligned > (factor 8 > >> in many passes) and the memory is not byte addressed. > >> > >> Thanks, > >> Boris > >> > >> ----8<---- > >> > >> LLVM assembly: > >> > >> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4 > >> > >> ; Function Attrs: nounwind > >> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 > { > >> entry: > >> tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x > i32]* > >> @buffer, i32 0, i32 0)) #2 > >> br label %while.body > >> > >> while.body: ; preds = %entry, %while.body > >> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ] > >> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, > >> i32 %i.010 > >> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2 > >> %cmp1 = icmp ne i32 %0, -559038737 > >> %inc = add nuw nsw i32 %i.010, 1 > >> %cmp11 = icmp eq i32 %i.010, 0 > >> %cmp = or i1 %cmp11, %cmp1 > >> br i1 %cmp, label %while.body, label %while.end > >> > >> while.end: ; preds = %while.body > >> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 > 0, > >> i32 %i.010 > >> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2 > >> store volatile i32 %1, i32* %result, align 4, !tbaa !2 > >> ret void > >> } > >> > >> declare dso_local void @fill_array(i32*) local_unnamed_addr #1 > >> > >> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp- > math"="false" > >> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame- > pointer- > >> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" > >> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp- > >> math"="false" "no-trapping-math"="false" "stack-protector-buffer- > size"="8" > >> "unsafe-fp-math"="false" "use-soft-float"="false" } > >> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable- > >> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer- > >> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" > >> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no- > trapping- > >> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" > >> "use-soft-float"="false" } > >> attributes #2 = { nounwind } > >> > >> !llvm.module.flags = !{!0} > >> !llvm.ident = !{!1} > >> > >> !0 = !{i32 1, !"wchar_size", i32 4} > >> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"} > >> !2 = !{!3, !3, i64 0} > >> !3 = !{!"int", !4, i64 0} > >> !4 = !{!"omnipotent char", !5, i64 0} > >> !5 = !{!"Simple C/C++ TBAA"} > >> > >> > >> (-debug-only=scalar-evolution,loop-reduce) for my arch: > >> > >> LSR on loop %while.body: > >> Collecting IV Chains. > >> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) > >> IV={@buffer,+,8}<nsw><%while.body> > >> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) > >> IV={0,+,1}<nuw><nsw><%while.body> > >> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 > >> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 > >> LSR has identified the following interesting factors and types: *8 > >> LSR is examining the following fixup sites: > >> UserInst=%cmp11, OperandValToReplace=%i.010 > >> UserInst=%0, OperandValToReplace=%arrayidx > >> LSR found 2 uses: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,8}<nsw><%while.body>) > >> > >> After generating reuse formulae: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> reg({0,+,8}<nuw><nsw><%while.body>) > >> reg({0,+,1}<nuw><nsw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,8}<nsw><%while.body>) > >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) > >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: > i32 > >> Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>) > >> in favor of formula reg({0,+,-1}<nw><%while.body>) > >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, > >> widest fixup type: i32* > >> > >> After filtering out undesirable candidates: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> reg({0,+,8}<nuw><nsw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,8}<nsw><%while.body>) > >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) > >> New best at 2 instructions 2 regs, with addrec cost 2. > >> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body> > >> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add. > >> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer > >> > >> The chosen solution requires 2 instructions 2 regs, with addrec cost 1, > plus 1 > >> base add: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,8}<nuw><nsw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) > >> > >> > >> (-debug-only=scalar-evolution,loop-reduce) for ARM: > >> > >> LSR on loop %while.body: > >> Collecting IV Chains. > >> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) > >> IV={@buffer,+,4}<nsw><%while.body> > >> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) > >> IV={0,+,1}<nuw><nsw><%while.body> > >> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 > >> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 > >> LSR has identified the following interesting factors and types: *4 > >> LSR is examining the following fixup sites: > >> UserInst=%cmp11, OperandValToReplace=%i.010 > >> UserInst=%0, OperandValToReplace=%arrayidx > >> LSR found 2 uses: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,4}<nsw><%while.body>) > >> > >> After generating reuse formulae: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> reg({0,+,4}<nuw><nsw><%while.body>) > >> reg({0,+,1}<nuw><nsw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,4}<nsw><%while.body>) > >> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) > >> -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) > >> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) > >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) > >> reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) > >> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: > i32 > >> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, > >> widest fixup type: i32* > >> Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) > >> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) > >> Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) > >> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) > >> > >> After filtering out undesirable candidates: > >> LSR is examining the following uses: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> reg({0,+,4}<nuw><nsw><%while.body>) > >> reg({0,+,1}<nuw><nsw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg({@buffer,+,4}<nsw><%while.body>) > >> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) > >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) > >> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) > >> New best at 1 instruction 2 regs, with addrec cost 1. > >> Regs: {0,+,-1}<nw><%while.body> @buffer > >> > >> The chosen solution requires 1 instruction 2 regs, with addrec cost 1: > >> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 > >> reg({0,+,-1}<nw><%while.body>) > >> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup > type: > >> i32* > >> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) > >> > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Boris Boesler via llvm-dev
2020-Jun-10 08:58 UTC
[llvm-dev] LoopStrengthReduction generates false code
The IR after LSR is: *** IR Dump After Loop Strength Reduction *** ; Preheader: entry: tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x i32]* @buffer, i32 0, i32 0)) #2 br label %while.body ; Loop: while.body: ; preds = %while.body, %entry %lsr.iv = phi i32 [ %lsr.iv.next, %while.body ], [ 0, %entry ] %uglygep = getelementptr i8, i8* bitcast ([10 x i32]* @buffer to i8*), i32 %lsr.iv %uglygep1 = bitcast i8* %uglygep to i32* %0 = load i32, i32* %uglygep1, align 4, !tbaa !2 %cmp1 = icmp ne i32 %0, -559038737 %cmp11 = icmp eq i32 %lsr.iv, 0 %cmp = or i1 %cmp11, %cmp1 %lsr.iv.next = add nuw i32 %lsr.iv, 8 br i1 %cmp, label %while.body, label %while.end ; Exit blocks while.end: ; preds = %while.body store volatile i32 %0, i32* %result, align 4, !tbaa !2 ret void I guess "%uglygep = getelementptr.." will be lowered to @buffer + (%lsr.iv * StoreSize(i32)). That's what I see in the final code. But then %lsr.iv.next should be incremented by 1; BUT it is incremented by 8. Incrementing %lsr.iv.next by 8 would make sense if getelementptr were lowered to @buffer + %lsr.iv. Thanks for your help, Boris> Am 09.06.2020 um 21:56 schrieb Eli Friedman <efriedma at quicinc.com>: > > Hmm. Then I'm not sure; at first glance, the debug output looks fine either way. Could you show the IR after LSR, and explain why it's wrong? > > -Eli > >> -----Original Message----- >> From: Boris Boesler <baembel at gmx.de> >> Sent: Tuesday, June 9, 2020 11:59 AM >> To: Eli Friedman <efriedma at quicinc.com> >> Cc: llvm-dev at lists.llvm.org >> Subject: [EXT] Re: [llvm-dev] LoopStrengthReduction generates false code >> >> Hm, no. I expect byte addresses - everywhere. The compiler should not know >> that the arch needs word addresses. During lowering LOAD and STORE get >> explicit conversion operations for the memory address. Even if my arch was >> byte addressed the code would be false/illegal. >> >> Boris >> >>> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at quicinc.com>: >>> >>> Blindly guessing here, "memory is not byte addressed", but you never fixed >> ScalarEvolution to handle that, so it's modeling the GEP in a way you're not >> expecting. >>> >>> -Eli >>> >>>> -----Original Message----- >>>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Boris >> Boesler >>>> via llvm-dev >>>> Sent: Tuesday, June 9, 2020 1:17 AM >>>> To: llvm-dev at lists.llvm.org >>>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code >>>> >>>> Hi. >>>> >>>> In my backend I get false code after using StrengthLoopReduction. In the >>>> generated code the loop index variable is multiplied by 8 (correct, >> everything >>>> is 64 bit aligned) to get an address offset, and the index variable is >>>> incremented by 1*8, which is not correct. It should be incremented by 1 >>>> only. The factor 8 appears again. >>>> >>>> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) >> for >>>> my backend and the ARM backend, but simply can't read/understand it. >>>> They differ in factors 4 vs 8 (ok), but there are more differences, probably >>>> caused by the implementation of TargetTransformInfo for ARM, while I >>>> haven't implemented it for my arch, yet. >>>> >>>> How can I debug this further? In my arch everything is 64 bit aligned >> (factor 8 >>>> in many passes) and the memory is not byte addressed. >>>> >>>> Thanks, >>>> Boris >>>> >>>> ----8<---- >>>> >>>> LLVM assembly: >>>> >>>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4 >>>> >>>> ; Function Attrs: nounwind >>>> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 >> { >>>> entry: >>>> tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x >> i32]* >>>> @buffer, i32 0, i32 0)) #2 >>>> br label %while.body >>>> >>>> while.body: ; preds = %entry, %while.body >>>> %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ] >>>> %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, >>>> i32 %i.010 >>>> %0 = load i32, i32* %arrayidx, align 4, !tbaa !2 >>>> %cmp1 = icmp ne i32 %0, -559038737 >>>> %inc = add nuw nsw i32 %i.010, 1 >>>> %cmp11 = icmp eq i32 %i.010, 0 >>>> %cmp = or i1 %cmp11, %cmp1 >>>> br i1 %cmp, label %while.body, label %while.end >>>> >>>> while.end: ; preds = %while.body >>>> %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 >> 0, >>>> i32 %i.010 >>>> %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2 >>>> store volatile i32 %1, i32* %result, align 4, !tbaa !2 >>>> ret void >>>> } >>>> >>>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1 >>>> >>>> attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp- >> math"="false" >>>> "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame- >> pointer- >>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" >>>> "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp- >>>> math"="false" "no-trapping-math"="false" "stack-protector-buffer- >> size"="8" >>>> "unsafe-fp-math"="false" "use-soft-float"="false" } >>>> attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable- >>>> tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer- >>>> elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" >>>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no- >> trapping- >>>> math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" >>>> "use-soft-float"="false" } >>>> attributes #2 = { nounwind } >>>> >>>> !llvm.module.flags = !{!0} >>>> !llvm.ident = !{!1} >>>> >>>> !0 = !{i32 1, !"wchar_size", i32 4} >>>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"} >>>> !2 = !{!3, !3, i64 0} >>>> !3 = !{!"int", !4, i64 0} >>>> !4 = !{!"omnipotent char", !5, i64 0} >>>> !5 = !{!"Simple C/C++ TBAA"} >>>> >>>> >>>> (-debug-only=scalar-evolution,loop-reduce) for my arch: >>>> >>>> LSR on loop %while.body: >>>> Collecting IV Chains. >>>> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) >>>> IV={@buffer,+,8}<nsw><%while.body> >>>> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) >>>> IV={0,+,1}<nuw><nsw><%while.body> >>>> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 >>>> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 >>>> LSR has identified the following interesting factors and types: *8 >>>> LSR is examining the following fixup sites: >>>> UserInst=%cmp11, OperandValToReplace=%i.010 >>>> UserInst=%0, OperandValToReplace=%arrayidx >>>> LSR found 2 uses: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,8}<nsw><%while.body>) >>>> >>>> After generating reuse formulae: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> reg({0,+,8}<nuw><nsw><%while.body>) >>>> reg({0,+,1}<nuw><nsw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,8}<nsw><%while.body>) >>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: >> i32 >>>> Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>) >>>> in favor of formula reg({0,+,-1}<nw><%while.body>) >>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, >>>> widest fixup type: i32* >>>> >>>> After filtering out undesirable candidates: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> reg({0,+,8}<nuw><nsw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,8}<nsw><%while.body>) >>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >>>> New best at 2 instructions 2 regs, with addrec cost 2. >>>> Regs: {0,+,-1}<nw><%while.body> {@buffer,+,8}<nsw><%while.body> >>>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add. >>>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer >>>> >>>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1, >> plus 1 >>>> base add: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,8}<nuw><nsw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>) >>>> >>>> >>>> (-debug-only=scalar-evolution,loop-reduce) for ARM: >>>> >>>> LSR on loop %while.body: >>>> Collecting IV Chains. >>>> IV Chain#0 Head: ( %0 = load i32, i32* %arrayidx, align 4, !tbaa !2) >>>> IV={@buffer,+,4}<nsw><%while.body> >>>> IV Chain#1 Head: ( %cmp11 = icmp eq i32 %i.010, 0) >>>> IV={0,+,1}<nuw><nsw><%while.body> >>>> IV Chain#1 Inc: ( %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1 >>>> Chain: %cmp11 = icmp eq i32 %i.010, 0 Cost: 0 >>>> LSR has identified the following interesting factors and types: *4 >>>> LSR is examining the following fixup sites: >>>> UserInst=%cmp11, OperandValToReplace=%i.010 >>>> UserInst=%0, OperandValToReplace=%arrayidx >>>> LSR found 2 uses: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,4}<nsw><%while.body>) >>>> >>>> After generating reuse formulae: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> reg({0,+,4}<nuw><nsw><%while.body>) >>>> reg({0,+,1}<nuw><nsw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,4}<nsw><%while.body>) >>>> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) >>>> -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) >>>> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) >>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >>>> reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) >>>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: >> i32 >>>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, >>>> widest fixup type: i32* >>>> Filtering out formula -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>) >>>> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) >>>> Filtering out formula reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>) >>>> in favor of formula reg({@buffer,+,4}<nsw><%while.body>) >>>> >>>> After filtering out undesirable candidates: >>>> LSR is examining the following uses: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> reg({0,+,4}<nuw><nsw><%while.body>) >>>> reg({0,+,1}<nuw><nsw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg({@buffer,+,4}<nsw><%while.body>) >>>> reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>) >>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >>>> reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>) >>>> New best at 1 instruction 2 regs, with addrec cost 1. >>>> Regs: {0,+,-1}<nw><%while.body> @buffer >>>> >>>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1: >>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 >>>> reg({0,+,-1}<nw><%while.body>) >>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup >> type: >>>> i32* >>>> reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>) >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >