thr3ads.net - llvm dev - [LLVMdev] Question about load clustering in the machine scheduler [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Tom Stellard

2015-Mar-27 02:36 UTC

[LLVMdev] Question about load clustering in the machine scheduler

Hi,

I have a program with over 100 loads (each with a 10 cycle latency)
at the beginning of the program, and I can't figure out how to get
the machine scheduler to intermix ALU instructions with the loads to
effectively hide the latency.

It seems the issue is with load clustering.  I restrict load clustering
to 4 at a time, but when I look at the debug output, the loads are
always being scheduled based on the fact that that are clustered. e.g.

Pick Top CLUSTER
Scheduling SU(10) %vreg13<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 4;
mem:LD4[<unknown>] SGPR_32:%vreg13 SReg_128:%vreg9

I have a feeling there is something wrong with my machine model in the
R600 backend, but I've experimented with a few variations of it and have
been unable to solve this problem.  Does anyone have any idea what I
might be doing wrong?

Here are my resource definitions from lib/Target/R600/SISchedule.td

// BufferSize = 0 means the processors are in-order.
let BufferSize = 0 in {
  
// XXX: Are the resource counts correct?
def HWBranch : ProcResource<1>;  
def HWExport : ProcResource<7>;   // Taken from S_WAITCNT
def HWLGKM   : ProcResource<31>;  // Taken from S_WAITCNT
def HWSALU   : ProcResource<1>;  
def HWVMEM   : ProcResource<15>;  // Taken from S_WAITCNT
def HWVALU   : ProcResource<1>;
  
}

Thanks,
Tom

Andrew Trick

2015-Mar-27 06:50 UTC

head link

[LLVMdev] Question about load clustering in the machine scheduler

> On Mar 26, 2015, at 7:36 PM, Tom Stellard <tom at stellard.net>
wrote:
> 
> Hi,
> 
> I have a program with over 100 loads (each with a 10 cycle latency)
> at the beginning of the program, and I can't figure out how to get
> the machine scheduler to intermix ALU instructions with the loads to
> effectively hide the latency.
> 
> It seems the issue is with load clustering.  I restrict load clustering
> to 4 at a time, but when I look at the debug output, the loads are
> always being scheduled based on the fact that that are clustered. e.g.
> 
> Pick Top CLUSTER
> Scheduling SU(10) %vreg13<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 4;
mem:LD4[<unknown>] SGPR_32:%vreg13 SReg_128:%vreg9
Well, only 4 loads in a sequence should have the “cluster” edges. You should be
able to see that when the DAG is printed before scheduling.

Even without that limit, stalls take precedence over load clustering. So when
you run out of load resources (15?) the scheduler should choose something else.
> I have a feeling there is something wrong with my machine model in the
> R600 backend, but I've experimented with a few variations of it and
have
> been unable to solve this problem.  Does anyone have any idea what I
> might be doing wrong?
Sorry, not without actually looking through the debug output. The output lists
the cycle time at each instruction, so you can see where the scheduler thinks
the stalls are.

BTW- I just checked in a small fix for in-order scheduling that might make
debugging this easier.

Andy
> Here are my resource definitions from lib/Target/R600/SISchedule.td
> 
> // BufferSize = 0 means the processors are in-order.
> let BufferSize = 0 in {
> 
> // XXX: Are the resource counts correct?
> def HWBranch : ProcResource<1>;  
> def HWExport : ProcResource<7>;   // Taken from S_WAITCNT
> def HWLGKM   : ProcResource<31>;  // Taken from S_WAITCNT
> def HWSALU   : ProcResource<1>;  
> def HWVMEM   : ProcResource<15>;  // Taken from S_WAITCNT
> def HWVALU   : ProcResource<1>;
> 
> }
> 
> Thanks,
> Tom
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Tom Stellard

2015-Mar-27 14:52 UTC

head link

[LLVMdev] Question about load clustering in the machine scheduler

On Thu, Mar 26, 2015 at 11:50:20PM -0700, Andrew Trick
wrote:> 
> > On Mar 26, 2015, at 7:36 PM, Tom Stellard <tom at stellard.net>
wrote:
> > 
> > Hi,
> > 
> > I have a program with over 100 loads (each with a 10 cycle latency)
> > at the beginning of the program, and I can't figure out how to get
> > the machine scheduler to intermix ALU instructions with the loads to
> > effectively hide the latency.
> > 
> > It seems the issue is with load clustering.  I restrict load
clustering
> > to 4 at a time, but when I look at the debug output, the loads are
> > always being scheduled based on the fact that that are clustered. e.g.
> > 
> > Pick Top CLUSTER
> > Scheduling SU(10) %vreg13<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9,
4; mem:LD4[<unknown>] SGPR_32:%vreg13 SReg_128:%vreg9
> 
> Well, only 4 loads in a sequence should have the “cluster” edges. You
should be able to see that when the DAG is printed before scheduling.
> 
There are 4 consecutive 'Pick Top CLUSTER' then a 'Pick Top
WEAK' and
then the pattern repeats itself.  All of these are loads.
> Even without that limit, stalls take precedence over load clustering. So
when you run out of load resources (15?) the scheduler should choose something
else.
> 
Is this the code that checks for stalls?

  if (tryLess(Zone.getLatencyStallCycles(TryCand.SU),
              Zone.getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))


It is disabled if (!SU->isUnbuffered)


> > I have a feeling there is something wrong with my machine model in the
> > R600 backend, but I've experimented with a few variations of it
and have
> > been unable to solve this problem.  Does anyone have any idea what I
> > might be doing wrong?
> 
> Sorry, not without actually looking through the debug output. The output
lists the cycle time at each instruction, so you can see where the scheduler
thinks the stalls are.
> 
There are actually 31 resources defined for loads.  However, there
aren't actually 31 load units in the hardware.  There is 1 load unit
that can hold up to 31 loads waiting to be executed, but only 1 load
can be executed at a time.



Pick Top CLUSTER   
Scheduling SU(43) %vreg46<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 48;
mem:LD4[<unknown>] SGPR_32:%vreg46 SReg_128:%vreg9
  SReg_32: 45 > 44(+ 0 livethru)
  VS_32: 51 > 18(+ 0 livethru)
  Ready @46c
  HWLGKM +1x105u
  TopQ.A BotLatency SU(43) 78c
  *** Max MOps 1 at cycle 46
Cycle: 47 TopQ.A
TopQ.A @47c
  Retired: 47
  Executed: 47c
  Critical: 47c, 47 MOps
  ExpectedLatency: 10c
  - Latency limited.
BotQ.A RemLatency SU(1698) 99c
  TopQ.A + Remain MOps: 1692
TopQ.A RemLatency SU(201) 97c
  BotQ.A + Remain MOps: 1647
BotQ.A: 1698 1694 1695


Here is example debugging output which.  Where is the cycle time
here?

> BTW- I just checked in a small fix for in-order scheduling that might make
debugging this easier.
> 
I will take a look at this.

Thanks,
Tom
> Andy
> 
> > Here are my resource definitions from lib/Target/R600/SISchedule.td
> > 
> > // BufferSize = 0 means the processors are in-order.
> > let BufferSize = 0 in {
> > 
> > // XXX: Are the resource counts correct?
> > def HWBranch : ProcResource<1>;  
> > def HWExport : ProcResource<7>;   // Taken from S_WAITCNT
> > def HWLGKM   : ProcResource<31>;  // Taken from S_WAITCNT
> > def HWSALU   : ProcResource<1>;  
> > def HWVMEM   : ProcResource<15>;  // Taken from S_WAITCNT
> > def HWVALU   : ProcResource<1>;
> > 
> > }
> 
> > 
> > Thanks,
> > Tom
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Mar 2015 - [LLVMdev] Question about load clustering in the machine scheduler

[LLVMdev] Question about load clustering in the machine scheduler

[LLVMdev] Question about load clustering in the machine scheduler

[LLVMdev] Question about load clustering in the machine scheduler

Maybe Matching Threads