thr3ads.net - llvm dev - [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?) [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Dec-20 06:53 UTC

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Mon, 2011-12-19 at 22:14 -0800, Andrew Trick wrote:> On Dec 19, 2011, at 3:19 PM, Hal Finkel wrote:
> 
> > On Mon, 2011-12-19 at 07:41 -0800, Andrew Trick wrote:
> >> On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
> >> 
> >>> On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote:
> >>> Now, to generate the best PPC schedules, there is one thing
you may
> >>>> want to override. The scheduler's priority function
has a
> >>>> HasReadyFilter attribute (enum). It can be overriden by
specializing
> >>>> hybrid_ls_rr_sort. Setting this to "true"
enables proper ILP
> >>>> scheduling, and maximizes the instructions that can issue
in one
> >>>> group, regardless of register pressure. We still care
about register
> >>>> pressure enough in ARM to avoid enabling this. I'm
really not sure how
> >>>> much it will help on modern PPC implementations though.
> >>>> hybrid_ls_rr_sort
> >>> 
> >>> Can this be done without modifying common code? It looks like
> >>> hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.
> >>> 
> >>> Thanks again,
> >>> Hal
> >> 
> >> Right. You would need to specialize the priority queue logic. A
small amount of common code.
> >> Andy
> > 
> > Andy,
> > 
> > I played around with this some today for my PPC 440 chips. These are
> > embedded chips (multiple pipelines but in-order), and may be more
> > similar to your ARMs than to the PPC-970 style designs...
> > 
> > I was able to get reasonable PPC 440 code generation by using the ILP
> > scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and
my
> > load/store reordering patch). This worked significantly better than
> > using either hybrid or ilp alone (with or without setting
> > HasReadyFilter). I was looking at my primary use case which is
> > partially-unrolled loops with loads, stores and floating-point
> > calculations.
> > 
> > This seems to work b/c ILP first groups the instructions to extract
> > parallelism and then the post-RA scheduler breaks up the groups to
avoid
> > stalls. This allows the scheduler to find its way out of what seems to
> > be a "local minimum" of sorts, whereby it wants to schedule
each
> > unrolled iteration of the loop sequentially. The reason why this seems
> > to occur is that the hybrid scheduler would prefer to suffer a large
> > data-dependency delay over a shorter full-pipeline delay. Do you know
> > why it would do this? (you can see PR11589 for an example if you'd
> > like).
> 
> The "ilp" scheduler has several heuristics designed to compensate
for lack of itinerary. Each of those heuristics has a flag, so you can see what
works for your target. I've never used that scheduler with an itinerary, but
it should work. It's just that some of the heuristics effectively override
the hazard checker.
> 
> The "hybrid" scheduler depends more on the itinerary/hazard
checker. It's less likely to schedule instructions close together if they
may induce a pipeline stall, regardless of operand latency.
> 
I'd prefer to have a scheduler that just does what I want :) -- How can
I make a modified version of the hybrid scheduler that will weight
operand latency and pipeline stalls more equally?

Here's my "thought experiment" (from PR11589): I have a bunch of
load-fadd-store chains to schedule. A store takes two cycles to clear
its last pipeline stage. The fadd takes longer to compute its result
(say 5 cycles), but can sustain a rate of 1 independent add per cycle.
As the scheduling is bottom-up, it will schedule a store, then it has a
choice: it can schedule another store (at a 1 cycle penalty), or it can
schedule the fadd associated with the store it just scheduled (with a 4
cycle penalty due to operand latency). It seems that the current hybrid
scheduler will choose the fadd, I want a scheduler that will make the
opposite choice.
> > Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
> > used? Is there a reason that this is a compile-time constant? Both
> > Hybrid and ILP have isReady() functions. I can certainly propose a
patch
> > to make them command-line options.
> 
> It's a compile time constant because it's clearly on the
scheduler's critical path and not used by any active targets. Enabling
HasReadyFilter turns the preRA scheduler into a strict scheduler such that the
hazard checker overrides all other heuristics. That's not what you want if
you're also enabling postRA scheduling!
Indeed, that makes sense.

Thanks again,
Hal
> 
> -Andy
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2011-Dec-20 07:20 UTC

head link

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> Here's my "thought experiment" (from PR11589): I have a bunch
of
> load-fadd-store chains to schedule. A store takes two cycles to clear
> its last pipeline stage. The fadd takes longer to compute its result
> (say 5 cycles), but can sustain a rate of 1 independent add per cycle.
> As the scheduling is bottom-up, it will schedule a store, then it has a
> choice: it can schedule another store (at a 1 cycle penalty), or it can
> schedule the fadd associated with the store it just scheduled (with a 4
> cycle penalty due to operand latency). It seems that the current hybrid
> scheduler will choose the fadd, I want a scheduler that will make the
> opposite choice.
That's just wrong. You may need to look at -debug-only=pre-RA-sched and
debug your itinerary.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111219/449a3625/attachment.html>

Andrew Trick

2011-Dec-20 16:21 UTC

head link

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Dec 19, 2011, at 11:20 PM, Andrew Trick wrote:
> 
> On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> 
>> Here's my "thought experiment" (from PR11589): I have a
bunch of
>> load-fadd-store chains to schedule. A store takes two cycles to clear
>> its last pipeline stage. The fadd takes longer to compute its result
>> (say 5 cycles), but can sustain a rate of 1 independent add per cycle.
>> As the scheduling is bottom-up, it will schedule a store, then it has a
>> choice: it can schedule another store (at a 1 cycle penalty), or it can
>> schedule the fadd associated with the store it just scheduled (with a 4
>> cycle penalty due to operand latency). It seems that the current hybrid
>> scheduler will choose the fadd, I want a scheduler that will make the
>> opposite choice.
> 
> That's just wrong. You may need to look at -debug-only=pre-RA-sched and
debug your itinerary.
> 
> -Andy

Wait… if you're using "sched=ilp" then register pressure avoidance
is the primary heuristic. So, yes, it will schedule each chain individually as
long as the critical path is not *too* long.

sched=hybrid will only behave like this when it determines that register
pressure is high. You  have to make sure that the register pressure limit is
implemented properly for your target. debug-only=pre-RA-sched will show you the
number of locally live register out of the total available each time an
instruction is scheduled.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111220/a33695e9/attachment.html>

Hal Finkel

2011-Dec-20 16:35 UTC

head link

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Mon, 2011-12-19 at 23:20 -0800, Andrew Trick wrote:> 
> On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> 
> > Here's my "thought experiment" (from PR11589): I have a
bunch of
> > load-fadd-store chains to schedule. A store takes two cycles to
> > clear
> > its last pipeline stage. The fadd takes longer to compute its result
> > (say 5 cycles), but can sustain a rate of 1 independent add per
> > cycle.
> > As the scheduling is bottom-up, it will schedule a store, then it
> > has a
> > choice: it can schedule another store (at a 1 cycle penalty), or it
> > can
> > schedule the fadd associated with the store it just scheduled (with
> > a 4
> > cycle penalty due to operand latency). It seems that the current
> > hybrid
> > scheduler will choose the fadd, I want a scheduler that will make
> > the
> > opposite choice.
> 
> That's just wrong. You may need to look at -debug-only=pre-RA-sched
> and debug your itinerary.
Andy, I've already looked at the debug output quite a bit; please help
me understand what I'm missing...

First, looking at the code does seem to confirm my suspicion. This is
certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
will return the result of BUCompareLatency. That function first checks
for stalls and returns 1 or -1. Only after that does it look at the
relative latencies.

In addition, the stall computation is done using BUHasStall, and that
function only checks the current cycle. Without looking forward, I don't
understand how it could know how long the pipeline hazard will last.

It looks like this may have something to do with the height. Can you
explain how that is supposed to work?

For the specific example: We start with the initial store...

GPRC: 4 / 31
F4RC: 1 / 31

Examining Available:
Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94] [ID=102]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]

then it schedules a "token factor" that is attached to the address
computation required by the store (this is essentially a no-op,
right?)...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
[ORD=94] [ID=5]

Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
0x2c03a70 [ORD=94] [ID=5]

how here is the choice that we may want to be different...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

(with more debug turned on, I also see a bunch of messages like:
*** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
one of these for each of the other possible stores).

*** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
0x2bf3710 [ORD=92] [ID=105]

why did it choose this fadd over any of the other stores? the
corresponding unit descriptions are:

SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94]
[ID=102]

  # preds left       : 4
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 7
  Depth              : 0
  Height             : 0
  Predecessors:
   val #0x2c11ff0 - SU(105): Latency=3
   val #0x2c0cdd0 - SU(32): Latency=1
   val #0x2c11db0 - SU(103): Latency=1
   ch  #0x2c0af70 - SU(5): Latency=0
  Successors:
   ch  #0x2c0ac10 - SU(2): Latency=1

SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]

  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 1
  Latency            : 11
  Depth              : 0
  Height             : 0
  Predecessors:
   val #0x2c12110 - SU(106): Latency=6
   val #0x2c0d130 - SU(35): Latency=6
  Successors:
   val #0x2c11c90 - SU(102): Latency=3

Just from the debugging messages, it looks like what is happening is
that the scheduler is first rejecting the other stores because of
pipeline hazards and then picking the instruction with the lowest
latency. Looking at the code, it seems that this is exactly what it was
designed to do. If I'm wrong about that, please explain.

Thanks in advance,
Hal
> 
> 
> -Andy
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Dec 2011 - [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

Possibly Parallel Threads