thr3ads.net - llvm dev - [llvm-dev] LoopStrengthReduction generates false code [Jun 2020]

If this information is useful, please help other people find it:
Share via:

Boris Boesler via llvm-dev

2020-Jun-09 08:17 UTC

[llvm-dev] LoopStrengthReduction generates false code

Hi.

In my backend I get false code after using StrengthLoopReduction. In the
generated code the loop index variable is multiplied by 8 (correct, everything
is 64 bit aligned) to get an address offset, and the index variable is
incremented by 1*8, which is not correct. It should be incremented by 1 only.
The factor 8 appears again.

I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for my
backend and the ARM backend, but simply can't read/understand it. They
differ in factors 4 vs 8 (ok), but there are more differences, probably caused
by the implementation of TargetTransformInfo for ARM, while I haven't
implemented it for my arch, yet.

How can I debug this further? In my arch everything is 64 bit aligned (factor 8
in many passes) and the memory is not byte addressed.

Thanks,
Boris

----8<----

LLVM assembly:

@buffer = common dso_local global [10 x i32] zeroinitializer, align 4

; Function Attrs: nounwind
define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
entry:
  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x
i32]* @buffer, i32 0, i32 0)) #2
  br label %while.body

while.body:                                       ; preds = %entry, %while.body
  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0, i32
%i.010
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %cmp1 = icmp ne i32 %0, -559038737
  %inc = add nuw nsw i32 %i.010, 1
  %cmp11 = icmp eq i32 %i.010, 0
  %cmp = or i1 %cmp11, %cmp1
  br i1 %cmp, label %while.body, label %while.end

while.end:                                        ; preds = %while.body
  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32 0,
i32 %i.010
  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
  store volatile i32 %1, i32* %result, align 4, !tbaa !2
  ret void
}

declare dso_local void @fill_array(i32*) local_unnamed_addr #1

attributes #0 = { nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false"
"less-precise-fpmad"="false"
"no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
"no-jump-tables"="false"
"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
"use-soft-float"="false" }
attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false"
"less-precise-fpmad"="false"
"no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
"use-soft-float"="false" }
attributes #2 = { nounwind }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"int", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C/C++ TBAA"}


(-debug-only=scalar-evolution,loop-reduce) for my arch:

LSR on loop %while.body:
Collecting IV Chains.
IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
IV={@buffer,+,8}<nsw><%while.body>
IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
IV={0,+,1}<nuw><nsw><%while.body>
IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
LSR has identified the following interesting factors and types: *8
LSR is examining the following fixup sites:
  UserInst=%cmp11, OperandValToReplace=%i.010
  UserInst=%0, OperandValToReplace=%arrayidx
LSR found 2 uses:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,8}<nsw><%while.body>)

After generating reuse formulae:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,8}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,8}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
  Filtering out formula reg({0,+,1}<nuw><nsw><%while.body>)
    in favor of formula reg({0,+,-1}<nw><%while.body>)
Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup type: i32*

After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,8}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,8}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
New best at 2 instructions 2 regs, with addrec cost 2.
 Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
 Regs: {0,+,8}<nuw><nsw><%while.body> @buffer

The chosen solution requires 2 instructions 2 regs, with addrec cost 1, plus 1
base add:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,8}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)


(-debug-only=scalar-evolution,loop-reduce) for ARM:

LSR on loop %while.body:
Collecting IV Chains.
IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
IV={@buffer,+,4}<nsw><%while.body>
IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
IV={0,+,1}<nuw><nsw><%while.body>
IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]) IV+1
Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
LSR has identified the following interesting factors and types: *4
LSR is examining the following fixup sites:
  UserInst=%cmp11, OperandValToReplace=%i.010
  UserInst=%0, OperandValToReplace=%arrayidx
LSR found 2 uses:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,4}<nsw><%while.body>)

After generating reuse formulae:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,4}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,4}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
Filtering for use LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0},
widest fixup type: i32*
  Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
  Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)

After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
    reg({0,+,4}<nuw><nsw><%while.body>)
    reg({0,+,1}<nuw><nsw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg({@buffer,+,4}<nsw><%while.body>)
    reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
    reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
New best at 1 instruction 2 regs, with addrec cost 1.
 Regs: {0,+,-1}<nw><%while.body> @buffer

The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
    reg({0,+,-1}<nw><%while.body>)
  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type:
i32*
    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)

Eli Friedman via llvm-dev

2020-Jun-09 17:36 UTC

head link

[llvm-dev] LoopStrengthReduction generates false code

Blindly guessing here, "memory is not byte addressed", but you never
fixed ScalarEvolution to handle that, so it's modeling the GEP in a way
you're not expecting.

-Eli
> -----Original Message-----
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Boris Boesler
> via llvm-dev
> Sent: Tuesday, June 9, 2020 1:17 AM
> To: llvm-dev at lists.llvm.org
> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>
> Hi.
>
> In my backend I get false code after using StrengthLoopReduction. In the
> generated code the loop index variable is multiplied by 8 (correct,
everything
> is 64 bit aligned) to get an address offset, and the index variable is
> incremented by 1*8, which is not correct. It should be incremented by 1
> only. The factor 8 appears again.
>
> I compared the debug output (-debug-only=scalar-evolution,loop-reduce) for
> my backend and the ARM backend, but simply can't read/understand it.
> They differ in factors 4 vs 8 (ok), but there are more differences,
probably
> caused by the implementation of TargetTransformInfo for ARM, while I
> haven't implemented it for my arch, yet.
>
> How can I debug this further? In my arch everything is 64 bit aligned
(factor 8
> in many passes) and the memory is not byte addressed.
>
> Thanks,
> Boris
>
> ----8<----
>
> LLVM assembly:
>
> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>
> ; Function Attrs: nounwind
> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
> entry:
>   tail call void @fill_array(i32* getelementptr inbounds ([10 x i32], [10 x
i32]*
> @buffer, i32 0, i32 0)) #2
>   br label %while.body
>
> while.body:                                       ; preds = %entry,
%while.body
>   %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>   %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32
0,
> i32 %i.010
>   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>   %cmp1 = icmp ne i32 %0, -559038737
>   %inc = add nuw nsw i32 %i.010, 1
>   %cmp11 = icmp eq i32 %i.010, 0
>   %cmp = or i1 %cmp11, %cmp1
>   br i1 %cmp, label %while.body, label %while.end
>
> while.end:                                        ; preds = %while.body
>   %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer, i32
0,
> i32 %i.010
>   %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>   store volatile i32 %1, i32* %result, align 4, !tbaa !2
>   ret void
> }
>
> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>
> attributes #0 = { nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
> "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
> "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-
> math"="false" "no-trapping-math"="false"
"stack-protector-buffer-size"="8"
> "unsafe-fp-math"="false"
"use-soft-float"="false" }
> attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-
> tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
> "no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false" "no-trapping-
> math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
> "use-soft-float"="false" }
> attributes #2 = { nounwind }
>
> !llvm.module.flags = !{!0}
> !llvm.ident = !{!1}
>
> !0 = !{i32 1, !"wchar_size", i32 4}
> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
> !2 = !{!3, !3, i64 0}
> !3 = !{!"int", !4, i64 0}
> !4 = !{!"omnipotent char", !5, i64 0}
> !5 = !{!"Simple C/C++ TBAA"}
>
>
> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>
> LSR on loop %while.body:
> Collecting IV Chains.
> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> IV={@buffer,+,8}<nsw><%while.body>
> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> IV={0,+,1}<nuw><nsw><%while.body>
> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ])
IV+1
> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> LSR has identified the following interesting factors and types: *8
> LSR is examining the following fixup sites:
>   UserInst=%cmp11, OperandValToReplace=%i.010
>   UserInst=%0, OperandValToReplace=%arrayidx
> LSR found 2 uses:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>
> After generating reuse formulae:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,8}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
i32
>   Filtering out formula
reg({0,+,1}<nuw><nsw><%while.body>)
>     in favor of formula reg({0,+,-1}<nw><%while.body>)
> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
> widest fixup type: i32*
>
> After filtering out undesirable candidates:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,8}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,8}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
> New best at 2 instructions 2 regs, with addrec cost 2.
>  Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>  Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>
> The chosen solution requires 2 instructions 2 regs, with addrec cost 1,
plus 1
> base add:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,8}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg(@buffer) + 1*reg({0,+,8}<nuw><nsw><%while.body>)
>
>
> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>
> LSR on loop %while.body:
> Collecting IV Chains.
> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
> IV={@buffer,+,4}<nsw><%while.body>
> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
> IV={0,+,1}<nuw><nsw><%while.body>
> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ])
IV+1
> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
> LSR has identified the following interesting factors and types: *4
> LSR is examining the following fixup sites:
>   UserInst=%cmp11, OperandValToReplace=%i.010
>   UserInst=%0, OperandValToReplace=%arrayidx
> LSR found 2 uses:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>
> After generating reuse formulae:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,4}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>     -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>     reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>     reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type:
i32
> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
> widest fixup type: i32*
>   Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
>     in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>   Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
>     in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>
> After filtering out undesirable candidates:
> LSR is examining the following uses:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>     reg({0,+,4}<nuw><nsw><%while.body>)
>     reg({0,+,1}<nuw><nsw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg({@buffer,+,4}<nsw><%while.body>)
>     reg(@buffer) + 1*reg({0,+,4}<nuw><nsw><%while.body>)
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>     reg(@buffer) + 4*reg({0,+,1}<nuw><nsw><%while.body>)
> New best at 1 instruction 2 regs, with addrec cost 1.
>  Regs: {0,+,-1}<nw><%while.body> @buffer
>
> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>   LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>     reg({0,+,-1}<nw><%while.body>)
>   LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup
type:
> i32*
>     reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Boris Boesler via llvm-dev

2020-Jun-09 18:59 UTC

head link

[llvm-dev] LoopStrengthReduction generates false code

Hm, no. I expect byte addresses - everywhere. The compiler should not know that
the arch needs word addresses. During lowering LOAD and STORE get explicit
conversion operations for the memory address. Even if my arch was byte addressed
the code would be false/illegal.

Boris
> Am 09.06.2020 um 19:36 schrieb Eli Friedman <efriedma at
quicinc.com>:
> 
> Blindly guessing here, "memory is not byte addressed", but you
never fixed ScalarEvolution to handle that, so it's modeling the GEP in a
way you're not expecting.
> 
> -Eli
> 
>> -----Original Message-----
>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Boris Boesler
>> via llvm-dev
>> Sent: Tuesday, June 9, 2020 1:17 AM
>> To: llvm-dev at lists.llvm.org
>> Subject: [EXT] [llvm-dev] LoopStrengthReduction generates false code
>> 
>> Hi.
>> 
>> In my backend I get false code after using StrengthLoopReduction. In
the
>> generated code the loop index variable is multiplied by 8 (correct,
everything
>> is 64 bit aligned) to get an address offset, and the index variable is
>> incremented by 1*8, which is not correct. It should be incremented by 1
>> only. The factor 8 appears again.
>> 
>> I compared the debug output (-debug-only=scalar-evolution,loop-reduce)
for
>> my backend and the ARM backend, but simply can't read/understand
it.
>> They differ in factors 4 vs 8 (ok), but there are more differences,
probably
>> caused by the implementation of TargetTransformInfo for ARM, while I
>> haven't implemented it for my arch, yet.
>> 
>> How can I debug this further? In my arch everything is 64 bit aligned
(factor 8
>> in many passes) and the memory is not byte addressed.
>> 
>> Thanks,
>> Boris
>> 
>> ----8<----
>> 
>> LLVM assembly:
>> 
>> @buffer = common dso_local global [10 x i32] zeroinitializer, align 4
>> 
>> ; Function Attrs: nounwind
>> define dso_local void @some_main(i32* %result) local_unnamed_addr #0 {
>> entry:
>>  tail call void @fill_array(i32* getelementptr inbounds ([10 x i32],
[10 x i32]*
>> @buffer, i32 0, i32 0)) #2
>>  br label %while.body
>> 
>> while.body:                                       ; preds = %entry,
%while.body
>>  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body ]
>>  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @buffer,
i32 0,
>> i32 %i.010
>>  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
>>  %cmp1 = icmp ne i32 %0, -559038737
>>  %inc = add nuw nsw i32 %i.010, 1
>>  %cmp11 = icmp eq i32 %i.010, 0
>>  %cmp = or i1 %cmp11, %cmp1
>>  br i1 %cmp, label %while.body, label %while.end
>> 
>> while.end:                                        ; preds = %while.body
>>  %arrayidx2 = getelementptr inbounds [10 x i32], [10 x i32]* @buffer,
i32 0,
>> i32 %i.010
>>  %1 = load i32, i32* %arrayidx2, align 4, !tbaa !2
>>  store volatile i32 %1, i32* %result, align 4, !tbaa !2
>>  ret void
>> }
>> 
>> declare dso_local void @fill_array(i32*) local_unnamed_addr #1
>> 
>> attributes #0 = { nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
>> "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>> "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-
>> math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
>> "unsafe-fp-math"="false"
"use-soft-float"="false" }
>> attributes #1 = {
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-
>> tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-
>> elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
>> "no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false" "no-trapping-
>> math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
>> "use-soft-float"="false" }
>> attributes #2 = { nounwind }
>> 
>> !llvm.module.flags = !{!0}
>> !llvm.ident = !{!1}
>> 
>> !0 = !{i32 1, !"wchar_size", i32 4}
>> !1 = !{!"clang version 7.0.1 (tags/RELEASE_701/final)"}
>> !2 = !{!3, !3, i64 0}
>> !3 = !{!"int", !4, i64 0}
>> !4 = !{!"omnipotent char", !5, i64 0}
>> !5 = !{!"Simple C/C++ TBAA"}
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for my arch:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,8}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body
]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *8
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup
type: i32
>>  Filtering out formula
reg({0,+,1}<nuw><nsw><%while.body>)
>>    in favor of formula reg({0,+,-1}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>> widest fixup type: i32*
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,8}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> New best at 2 instructions 2 regs, with addrec cost 2.
>> Regs: {0,+,-1}<nw><%while.body>
{@buffer,+,8}<nsw><%while.body>
>> New best at 2 instructions 2 regs, with addrec cost 1, plus 1 base add.
>> Regs: {0,+,8}<nuw><nsw><%while.body> @buffer
>> 
>> The chosen solution requires 2 instructions 2 regs, with addrec cost 1,
plus 1
>> base add:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,8}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg(@buffer) +
1*reg({0,+,8}<nuw><nsw><%while.body>)
>> 
>> 
>> (-debug-only=scalar-evolution,loop-reduce) for ARM:
>> 
>> LSR on loop %while.body:
>> Collecting IV Chains.
>> IV Chain#0 Head: (  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2)
>> IV={@buffer,+,4}<nsw><%while.body>
>> IV Chain#1 Head: (  %cmp11 = icmp eq i32 %i.010, 0)
>> IV={0,+,1}<nuw><nsw><%while.body>
>> IV Chain#1  Inc: (  %i.010 = phi i32 [ 0, %entry ], [ %inc, %while.body
]) IV+1
>> Chain:   %cmp11 = icmp eq i32 %i.010, 0 Cost: 0
>> LSR has identified the following interesting factors and types: *4
>> LSR is examining the following fixup sites:
>>  UserInst=%cmp11, OperandValToReplace=%i.010
>>  UserInst=%0, OperandValToReplace=%arrayidx
>> LSR found 2 uses:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After generating reuse formulae:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    -1*reg({(-1 * @buffer),+,-4}<nw><%while.body>)
>>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) + -1*reg({0,+,-4}<nw><%while.body>)
>> Filtering for use LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup
type: i32
>> Filtering for use LSR Use: Kind=Address of i32 in addrspace(0),
Offsets={0},
>> widest fixup type: i32*
>>  Filtering out formula -1*reg({(-1 *
@buffer),+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>>  Filtering out formula reg(@buffer) +
-1*reg({0,+,-4}<nw><%while.body>)
>>    in favor of formula reg({@buffer,+,4}<nsw><%while.body>)
>> 
>> After filtering out undesirable candidates:
>> LSR is examining the following uses:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>    reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg({0,+,1}<nuw><nsw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg({@buffer,+,4}<nsw><%while.body>)
>>    reg(@buffer) +
1*reg({0,+,4}<nuw><nsw><%while.body>)
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>>    reg(@buffer) +
4*reg({0,+,1}<nuw><nsw><%while.body>)
>> New best at 1 instruction 2 regs, with addrec cost 1.
>> Regs: {0,+,-1}<nw><%while.body> @buffer
>> 
>> The chosen solution requires 1 instruction 2 regs, with addrec cost 1:
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg({0,+,-1}<nw><%while.body>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
fixup type:
>> i32*
>>    reg(@buffer) + -4*reg({0,+,-1}<nw><%while.body>)
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jun 2020 - LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code

[llvm-dev] LoopStrengthReduction generates false code

Apparently Analagous Threads