William J. Schmidt
2012-Sep-20  23:02 UTC
[LLVMdev] Scheduling question (memory dependency)
Greetings,
I'm investigating a bug in the PowerPC back end in which a load from a
storage address is being reordered prior to a store to the same storage
address.  I'm quite new to LLVM, so I would appreciate some help
understanding what I'm seeing from the dumps.  I assume that some
information is missing that would represent the memory dependency, but I
don't know what form that should take.
Example source code is as follows:
----------------------------------------------------------------
extern "C" { int printf(const char *, ...); void exit(int);}
struct foo {
  short i:8;
};
void check(struct foo f, short i) __attribute__((noinline)) {
  if (f.i != i) {
    short fi = f.i;
    printf("problem with %u != %u\n", fi, i);
    exit(0);
  }
}
---------------------------------------------------------------
The initial portion of the Clang output is:
define void @_Z5check3foos(%struct.foo* nocapture byval %f, i16 signext %i)
noinline {
entry:
  %0 = bitcast %struct.foo* %f to i16*
  %1 = load i16* %0, align 2
  ...
---------------------------------------------------------------
The code works OK at -O0.  At -O1, the first part of the generated code
is:
---------------------------------------------------------------
.L._Z5check3foos:
	.cfi_startproc
# BB#0:                                 # %entry
	mflr 0
	std 0, 16(1)
	stdu 1, -112(1)
.Ltmp1:
	.cfi_def_cfa_offset 112
.Ltmp2:
	.cfi_offset lr, 16
	lha 5, 162(1)
	sth 3, 162(1)
        ...
---------------------------------------------------------------
The problem here is that the incoming parameter in register 3 is stored
too late, after an attempt to load the value into register 5.
Looking at dumps with -debug, I see the following:
---------------------------------------------------------------
********** MACHINEINSTRS **********
# Machine code for function _Z5check3foos: Post SSA
Frame Objects:
  fi#-1: size=2, align=2, fixed, at location [SP+50]
Function Live Ins: %X3 in %vreg1, %X4 in %vreg2
0B	BB#0: derived from LLVM BB %entry
	    Live Ins: %X3 %X4
16B		%vreg2<def> = COPY %X4; G8RC_with_sub_32:%vreg2
32B		%vreg1<def> = COPY %X3; G8RC:%vreg1
48B		STH8 %vreg1<kill>, 0, <fi#-1>; mem:ST2[FixedStack-1]
G8RC:%vreg1
64B		%vreg4<def> = LHA 0, <fi#-1>; mem:LD2[%0] GPRC:%vreg4
                ...
---------------------------------------------------------------
So far, so good.  When we get to list scheduling, not quite so good:
---------------------------------------------------------------
********** List Scheduling **********
SU(0):   STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1]
  # preds left       : 0
  # succs left       : 4
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 0
  Successors:
   antiSU(2): Latency=0
   antiSU(2): Latency=0
   ch  SU(5): Latency=0
   ch  SU(4294967295) *: Latency=0
SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
  # preds left       : 0
  # succs left       : 3
  # rdefs left       : 0
  Latency            : 5
  Depth              : 0
  Height             : 0
  Successors:
   out SU(3): Latency=1
   val SU(2): Latency=5
   ch  SU(5): Latency=0
...
---------------------------------------------------------------
There is no dependency expressed between these two memory operations,
although they both access the stack address 162(X1).  The scheduler then
sees both instructions as ready, and chooses the load based on critical
path height:
---------------------------------------------------------------
*** Examining Available
Height 9: SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
Height 4: SU(0):   STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1]
*** Scheduling [0]: SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
---------------------------------------------------------------
The obvious questions are:  Why is there no dependence between these two
instructions?  And what needs to be done to ensure there is one?  My
guess is that we somehow need to unify FixedStack-1 with %0, but it's
not clear to me how this would be accomplished.
(The store is generated as part of SelectionDAGISel::LowerArguments from
lib/CodeGen/SelectionDAG/SelectionDAGBuilder, using the PowerPC-specific
code in lib/Target/PowerPC/PPCISelLowering.cpp.  The load is generated
directly from the "load" in the LLVM IR at some other time.)
Thanks very much for any help!
Bill
-- 
Bill Schmidt, Ph.D.
IBM Advance Toolchain for PowerLinux
IBM Linux Technology Center
wschmidt at us.ibm.com
wschmidt at linux.vnet.ibm.com
William J. Schmidt
2012-Sep-21  14:07 UTC
[LLVMdev] Scheduling question (memory dependency)
Here's another data point that may be useful.  [Scheduling experts,
please help! :) ]
If the two-byte bitfield is replaced by a two-byte struct (replace
"short i:8" with "short i", etc.), the scheduler properly
generates a
dependency between the store and the load.  For this case, a GEP is used
instead of a bitcast:
------------------------------------------------------------------
define void @_Z5check3fooj(%struct.foo* nocapture byval %f, i32 %i)
noinline {
entry:
  %i1 = getelementptr inbounds %struct.foo* %f, i64 0, i32 0
  %0 = load i16* %i1, align 2, !tbaa !0
------------------------------------------------------------------
One notable difference is the "!tbaa !0" decoration on the load.  I
don't know whether this helps or not.  Later the lowered instructions
look like:
------------------------------------------------------------------
16B		%vreg2<def> = COPY %X4; G8RC_with_sub_32:%vreg2
32B		%vreg1<def> = COPY %X3; G8RC:%vreg1
48B		STH8 %vreg1<kill>, 0, <fi#-1>; mem:ST2[FixedStack-1]
G8RC:%vreg1
64B		%vreg0<def> = LHZ 0, <fi#-1>; mem:LD2[%i11] GPRC:%vreg0
                ...
------------------------------------------------------------------
Note the %i11 instead of %0 on the LHZ as another difference.  The
scheduler then generates a dependency between the store and the load,
and everything works properly.
Does this help tickle any memories?
Thanks,
Bill
On Thu, 2012-09-20 at 18:02 -0500, William J. Schmidt
wrote:> Greetings,
> 
> I'm investigating a bug in the PowerPC back end in which a load from a
> storage address is being reordered prior to a store to the same storage
> address.  I'm quite new to LLVM, so I would appreciate some help
> understanding what I'm seeing from the dumps.  I assume that some
> information is missing that would represent the memory dependency, but I
> don't know what form that should take.
> 
> Example source code is as follows:
> 
> ----------------------------------------------------------------
> extern "C" { int printf(const char *, ...); void exit(int);}
> struct foo {
>   short i:8;
> };
> 
> void check(struct foo f, short i) __attribute__((noinline)) {
>   if (f.i != i) {
>     short fi = f.i;
>     printf("problem with %u != %u\n", fi, i);
>     exit(0);
>   }
> }
> ---------------------------------------------------------------
> 
> The initial portion of the Clang output is:
> 
> define void @_Z5check3foos(%struct.foo* nocapture byval %f, i16 signext %i)
noinline {
> entry:
>   %0 = bitcast %struct.foo* %f to i16*
>   %1 = load i16* %0, align 2
>   ...
> ---------------------------------------------------------------
> 
> The code works OK at -O0.  At -O1, the first part of the generated code
> is:
> 
> ---------------------------------------------------------------
> .L._Z5check3foos:
> 	.cfi_startproc
> # BB#0:                                 # %entry
> 	mflr 0
> 	std 0, 16(1)
> 	stdu 1, -112(1)
> .Ltmp1:
> 	.cfi_def_cfa_offset 112
> .Ltmp2:
> 	.cfi_offset lr, 16
> 	lha 5, 162(1)
> 	sth 3, 162(1)
>         ...
> ---------------------------------------------------------------
> 
> The problem here is that the incoming parameter in register 3 is stored
> too late, after an attempt to load the value into register 5.
> 
> Looking at dumps with -debug, I see the following:
> 
> ---------------------------------------------------------------
> ********** MACHINEINSTRS **********
> # Machine code for function _Z5check3foos: Post SSA
> Frame Objects:
>   fi#-1: size=2, align=2, fixed, at location [SP+50]
> Function Live Ins: %X3 in %vreg1, %X4 in %vreg2
> 
> 0B	BB#0: derived from LLVM BB %entry
> 	    Live Ins: %X3 %X4
> 16B		%vreg2<def> = COPY %X4; G8RC_with_sub_32:%vreg2
> 32B		%vreg1<def> = COPY %X3; G8RC:%vreg1
> 48B		STH8 %vreg1<kill>, 0, <fi#-1>; mem:ST2[FixedStack-1]
G8RC:%vreg1
> 64B		%vreg4<def> = LHA 0, <fi#-1>; mem:LD2[%0] GPRC:%vreg4
>                 ...
> ---------------------------------------------------------------
> 
> So far, so good.  When we get to list scheduling, not quite so good:
> 
> ---------------------------------------------------------------
> ********** List Scheduling **********
> SU(0):   STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1]
>   # preds left       : 0
>   # succs left       : 4
>   # rdefs left       : 0
>   Latency            : 3
>   Depth              : 0
>   Height             : 0
>   Successors:
>    antiSU(2): Latency=0
>    antiSU(2): Latency=0
>    ch  SU(5): Latency=0
>    ch  SU(4294967295) *: Latency=0
> 
> SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
>   # preds left       : 0
>   # succs left       : 3
>   # rdefs left       : 0
>   Latency            : 5
>   Depth              : 0
>   Height             : 0
>   Successors:
>    out SU(3): Latency=1
>    val SU(2): Latency=5
>    ch  SU(5): Latency=0
> ...
> ---------------------------------------------------------------
> 
> There is no dependency expressed between these two memory operations,
> although they both access the stack address 162(X1).  The scheduler then
> sees both instructions as ready, and chooses the load based on critical
> path height:
> 
> ---------------------------------------------------------------
> *** Examining Available
> Height 9: SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
> Height 4: SU(0):   STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1]
> *** Scheduling [0]: SU(1):   %R5<def> = LHA 162, %X1; mem:LD2[%0]
> ---------------------------------------------------------------
> 
> The obvious questions are:  Why is there no dependence between these two
> instructions?  And what needs to be done to ensure there is one?  My
> guess is that we somehow need to unify FixedStack-1 with %0, but it's
> not clear to me how this would be accomplished.
> 
> (The store is generated as part of SelectionDAGISel::LowerArguments from
> lib/CodeGen/SelectionDAG/SelectionDAGBuilder, using the PowerPC-specific
> code in lib/Target/PowerPC/PPCISelLowering.cpp.  The load is generated
> directly from the "load" in the LLVM IR at some other time.)
> 
> Thanks very much for any help!
> 
> Bill
>
Hi Bill, Which scheduler do you use? MI or SDNode one? In either case the problem is likely the same, but cause might be in a different place... The way I see it, you have an issue with the alias analyzer, not scheduler. When scheduling DAG is constructed, AA is checked for pairs of mem accessing objects, and if no potential interference is flagged by the AA the chain edge is _not_ inserted. If that decision is wrong, you will end up with a well hidden and randomly popping bugs. So the question much more likely is: Why AA sees these two objects as not aliasing, and are they properly described and presented to it? Does ld/bitcast has proper memory operands? Any flags on them? Is underlying memory object making sense? You can look at getUnderlyingObjectForInstr and MIsNeedChainEdge in the MI scheduling framework to see what I mean. If you are still using SDNode scheduling framework - it has a very similar functionality in a slightly different code. Hope this helps. Sergei --- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of William J. Schmidt > Sent: Friday, September 21, 2012 9:07 AM > To: llvmdev at cs.uiuc.edu > Subject: Re: [LLVMdev] Scheduling question (memory dependency) > > Here's another data point that may be useful. [Scheduling experts, > please help! :) ] > > If the two-byte bitfield is replaced by a two-byte struct (replace > "short i:8" with "short i", etc.), the scheduler properly generates a > dependency between the store and the load. For this case, a GEP is > used instead of a bitcast: > > ------------------------------------------------------------------ > define void @_Z5check3fooj(%struct.foo* nocapture byval %f, i32 %i) > noinline { > entry: > %i1 = getelementptr inbounds %struct.foo* %f, i64 0, i32 0 > %0 = load i16* %i1, align 2, !tbaa !0 > ------------------------------------------------------------------ > > One notable difference is the "!tbaa !0" decoration on the load. I > don't know whether this helps or not. Later the lowered instructions > look like: > > ------------------------------------------------------------------ > 16B %vreg2<def> = COPY %X4; G8RC_with_sub_32:%vreg2 > 32B %vreg1<def> = COPY %X3; G8RC:%vreg1 > 48B STH8 %vreg1<kill>, 0, <fi#-1>; mem:ST2[FixedStack-1] > G8RC:%vreg1 > 64B %vreg0<def> = LHZ 0, <fi#-1>; mem:LD2[%i11] GPRC:%vreg0 > ... > ------------------------------------------------------------------ > > Note the %i11 instead of %0 on the LHZ as another difference. The > scheduler then generates a dependency between the store and the load, > and everything works properly. > > Does this help tickle any memories? > > Thanks, > Bill > > > On Thu, 2012-09-20 at 18:02 -0500, William J. Schmidt wrote: > > Greetings, > > > > I'm investigating a bug in the PowerPC back end in which a load from > a > > storage address is being reordered prior to a store to the same > > storage address. I'm quite new to LLVM, so I would appreciate some > > help understanding what I'm seeing from the dumps. I assume that > some > > information is missing that would represent the memory dependency, > but > > I don't know what form that should take. > > > > Example source code is as follows: > > > > ---------------------------------------------------------------- > > extern "C" { int printf(const char *, ...); void exit(int);} struct > > foo { > > short i:8; > > }; > > > > void check(struct foo f, short i) __attribute__((noinline)) { > > if (f.i != i) { > > short fi = f.i; > > printf("problem with %u != %u\n", fi, i); > > exit(0); > > } > > } > > --------------------------------------------------------------- > > > > The initial portion of the Clang output is: > > > > define void @_Z5check3foos(%struct.foo* nocapture byval %f, i16 > > signext %i) noinline { > > entry: > > %0 = bitcast %struct.foo* %f to i16* > > %1 = load i16* %0, align 2 > > ... > > --------------------------------------------------------------- > > > > The code works OK at -O0. At -O1, the first part of the generated > > code > > is: > > > > --------------------------------------------------------------- > > .L._Z5check3foos: > > .cfi_startproc > > # BB#0: # %entry > > mflr 0 > > std 0, 16(1) > > stdu 1, -112(1) > > .Ltmp1: > > .cfi_def_cfa_offset 112 > > .Ltmp2: > > .cfi_offset lr, 16 > > lha 5, 162(1) > > sth 3, 162(1) > > ... > > --------------------------------------------------------------- > > > > The problem here is that the incoming parameter in register 3 is > > stored too late, after an attempt to load the value into register 5. > > > > Looking at dumps with -debug, I see the following: > > > > --------------------------------------------------------------- > > ********** MACHINEINSTRS ********** > > # Machine code for function _Z5check3foos: Post SSA Frame Objects: > > fi#-1: size=2, align=2, fixed, at location [SP+50] Function Live > > Ins: %X3 in %vreg1, %X4 in %vreg2 > > > > 0B BB#0: derived from LLVM BB %entry > > Live Ins: %X3 %X4 > > 16B %vreg2<def> = COPY %X4; G8RC_with_sub_32:%vreg2 > > 32B %vreg1<def> = COPY %X3; G8RC:%vreg1 > > 48B STH8 %vreg1<kill>, 0, <fi#-1>; mem:ST2[FixedStack-1] > G8RC:%vreg1 > > 64B %vreg4<def> = LHA 0, <fi#-1>; mem:LD2[%0] GPRC:%vreg4 > > ... > > --------------------------------------------------------------- > > > > So far, so good. When we get to list scheduling, not quite so good: > > > > --------------------------------------------------------------- > > ********** List Scheduling ********** > > SU(0): STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1] > > # preds left : 0 > > # succs left : 4 > > # rdefs left : 0 > > Latency : 3 > > Depth : 0 > > Height : 0 > > Successors: > > antiSU(2): Latency=0 > > antiSU(2): Latency=0 > > ch SU(5): Latency=0 > > ch SU(4294967295) *: Latency=0 > > > > SU(1): %R5<def> = LHA 162, %X1; mem:LD2[%0] > > # preds left : 0 > > # succs left : 3 > > # rdefs left : 0 > > Latency : 5 > > Depth : 0 > > Height : 0 > > Successors: > > out SU(3): Latency=1 > > val SU(2): Latency=5 > > ch SU(5): Latency=0 > > ... > > --------------------------------------------------------------- > > > > There is no dependency expressed between these two memory operations, > > although they both access the stack address 162(X1). The scheduler > > then sees both instructions as ready, and chooses the load based on > > critical path height: > > > > --------------------------------------------------------------- > > *** Examining Available > > Height 9: SU(1): %R5<def> = LHA 162, %X1; mem:LD2[%0] > > Height 4: SU(0): STH8 %X3<kill>, 162, %X1; mem:ST2[FixedStack-1] > > *** Scheduling [0]: SU(1): %R5<def> = LHA 162, %X1; mem:LD2[%0] > > --------------------------------------------------------------- > > > > The obvious questions are: Why is there no dependence between these > > two instructions? And what needs to be done to ensure there is one? > > My guess is that we somehow need to unify FixedStack-1 with %0, but > > it's not clear to me how this would be accomplished. > > > > (The store is generated as part of SelectionDAGISel::LowerArguments > > from lib/CodeGen/SelectionDAG/SelectionDAGBuilder, using the > > PowerPC-specific code in lib/Target/PowerPC/PPCISelLowering.cpp. The > > load is generated directly from the "load" in the LLVM IR at some > > other time.) > > > > Thanks very much for any help! > > > > Bill > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev