thr3ads.net - llvm dev - [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?) [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Dec-20 16:35 UTC

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Mon, 2011-12-19 at 23:20 -0800, Andrew Trick wrote:> 
> On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> 
> > Here's my "thought experiment" (from PR11589): I have a
bunch of
> > load-fadd-store chains to schedule. A store takes two cycles to
> > clear
> > its last pipeline stage. The fadd takes longer to compute its result
> > (say 5 cycles), but can sustain a rate of 1 independent add per
> > cycle.
> > As the scheduling is bottom-up, it will schedule a store, then it
> > has a
> > choice: it can schedule another store (at a 1 cycle penalty), or it
> > can
> > schedule the fadd associated with the store it just scheduled (with
> > a 4
> > cycle penalty due to operand latency). It seems that the current
> > hybrid
> > scheduler will choose the fadd, I want a scheduler that will make
> > the
> > opposite choice.
> 
> That's just wrong. You may need to look at -debug-only=pre-RA-sched
> and debug your itinerary.
Andy, I've already looked at the debug output quite a bit; please help
me understand what I'm missing...

First, looking at the code does seem to confirm my suspicion. This is
certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
will return the result of BUCompareLatency. That function first checks
for stalls and returns 1 or -1. Only after that does it look at the
relative latencies.

In addition, the stall computation is done using BUHasStall, and that
function only checks the current cycle. Without looking forward, I don't
understand how it could know how long the pipeline hazard will last.

It looks like this may have something to do with the height. Can you
explain how that is supposed to work?

For the specific example: We start with the initial store...

GPRC: 4 / 31
F4RC: 1 / 31

Examining Available:
Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94] [ID=102]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]

then it schedules a "token factor" that is attached to the address
computation required by the store (this is essentially a no-op,
right?)...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
[ORD=94] [ID=5]

Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
0x2c03a70 [ORD=94] [ID=5]

how here is the choice that we may want to be different...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

(with more debug turned on, I also see a bunch of messages like:
*** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
one of these for each of the other possible stores).

*** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
0x2bf3710 [ORD=92] [ID=105]

why did it choose this fadd over any of the other stores? the
corresponding unit descriptions are:

SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94]
[ID=102]

  # preds left       : 4
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 7
  Depth              : 0
  Height             : 0
  Predecessors:
   val #0x2c11ff0 - SU(105): Latency=3
   val #0x2c0cdd0 - SU(32): Latency=1
   val #0x2c11db0 - SU(103): Latency=1
   ch  #0x2c0af70 - SU(5): Latency=0
  Successors:
   ch  #0x2c0ac10 - SU(2): Latency=1

SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]

  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 1
  Latency            : 11
  Depth              : 0
  Height             : 0
  Predecessors:
   val #0x2c12110 - SU(106): Latency=6
   val #0x2c0d130 - SU(35): Latency=6
  Successors:
   val #0x2c11c90 - SU(102): Latency=3

Just from the debugging messages, it looks like what is happening is
that the scheduler is first rejecting the other stores because of
pipeline hazards and then picking the instruction with the lowest
latency. Looking at the code, it seems that this is exactly what it was
designed to do. If I'm wrong about that, please explain.

Thanks in advance,
Hal
> 
> 
> -Andy
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Dec-20 18:29 UTC

head link

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Tue, 2011-12-20 at 10:35 -0600, Hal Finkel wrote:> On Mon, 2011-12-19 at 23:20 -0800, Andrew Trick wrote:
> > 
> > On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> > 
> > > Here's my "thought experiment" (from PR11589): I
have a bunch of
> > > load-fadd-store chains to schedule. A store takes two cycles to
> > > clear
> > > its last pipeline stage. The fadd takes longer to compute its
result
> > > (say 5 cycles), but can sustain a rate of 1 independent add per
> > > cycle.
> > > As the scheduling is bottom-up, it will schedule a store, then it
> > > has a
> > > choice: it can schedule another store (at a 1 cycle penalty), or
it
> > > can
> > > schedule the fadd associated with the store it just scheduled
(with
> > > a 4
> > > cycle penalty due to operand latency). It seems that the current
> > > hybrid
> > > scheduler will choose the fadd, I want a scheduler that will make
> > > the
> > > opposite choice.
> > 
> > That's just wrong. You may need to look at
-debug-only=pre-RA-sched
> > and debug your itinerary.
> 
> Andy, I've already looked at the debug output quite a bit; please help
> me understand what I'm missing...
> 
> First, looking at the code does seem to confirm my suspicion. This is
> certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
> will return the result of BUCompareLatency. That function first checks
> for stalls and returns 1 or -1. Only after that does it look at the
> relative latencies.
Looking at this more carefully, I think that I see the problem. The
heights are set to account for the latencies:
PredSU->setHeightToAtLeast(SU->getHeight() + PredEdge->getLatency());

but the latencies are considered only if the node as an ILP scheduling
preference (the default in TargetLowering.h is None):
  bool LStall = (!checkPref || left->SchedulingPref == Sched::ILP) &&
    BUHasStall(left, LHeight, SPQ);
...

and the PPC backend does not override getSchedulingPreference.

 -Hal
> 
> In addition, the stall computation is done using BUHasStall, and that
> function only checks the current cycle. Without looking forward, I
don't
> understand how it could know how long the pipeline hazard will last.
> 
> It looks like this may have something to do with the height. Can you
> explain how that is supposed to work?
> 
> For the specific example: We start with the initial store...
> 
> GPRC: 4 / 31
> F4RC: 1 / 31
> 
> Examining Available:
> Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
> 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
> [ORD=94] [ID=102]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> *** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
> 0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
> arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]
> 
> then it schedules a "token factor" that is attached to the
address
> computation required by the store (this is essentially a no-op,
> right?)...
> 
> GPRC: 5 / 31
> F4RC: 2 / 31
> 
> Examining Available:
> Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
> [ORD=94] [ID=5]
> 
> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
> [ID=105]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> *** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
> 0x2c03a70 [ORD=94] [ID=5]
> 
> how here is the choice that we may want to be different...
> 
> GPRC: 5 / 31
> F4RC: 2 / 31
> 
> Examining Available:
> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
> [ID=105]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88]
[ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76]
[ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> (with more debug turned on, I also see a bunch of messages like:
> *** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
> arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
> one of these for each of the other possible stores).
> 
> *** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
> 0x2bf3710 [ORD=92] [ID=105]
> 
> why did it choose this fadd over any of the other stores? the
> corresponding unit descriptions are:
> 
> SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
> 0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94]
> [ID=102]
> 
>   # preds left       : 4
>   # succs left       : 1
>   # rdefs left       : 0
>   Latency            : 7
>   Depth              : 0
>   Height             : 0
>   Predecessors:
>    val #0x2c11ff0 - SU(105): Latency=3
>    val #0x2c0cdd0 - SU(32): Latency=1
>    val #0x2c11db0 - SU(103): Latency=1
>    ch  #0x2c0af70 - SU(5): Latency=0
>   Successors:
>    ch  #0x2c0ac10 - SU(2): Latency=1
> 
> SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]
> 
>   # preds left       : 2
>   # succs left       : 1
>   # rdefs left       : 1
>   Latency            : 11
>   Depth              : 0
>   Height             : 0
>   Predecessors:
>    val #0x2c12110 - SU(106): Latency=6
>    val #0x2c0d130 - SU(35): Latency=6
>   Successors:
>    val #0x2c11c90 - SU(102): Latency=3
> 
> Just from the debugging messages, it looks like what is happening is
> that the scheduler is first rejecting the other stores because of
> pipeline hazards and then picking the instruction with the lowest
> latency. Looking at the code, it seems that this is exactly what it was
> designed to do. If I'm wrong about that, please explain.
> 
> Thanks in advance,
> Hal
> 
> > 
> > 
> > -Andy
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2011-Dec-20 20:18 UTC

head link

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

On Dec 20, 2011, at 10:29 AM, Hal Finkel wrote:
> On Tue, 2011-12-20 at 10:35 -0600, Hal Finkel wrote:
>> On Mon, 2011-12-19 at 23:20 -0800, Andrew Trick wrote:
>>> 
>>> On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
>>> 
>>>> Here's my "thought experiment" (from PR11589): I
have a bunch of
>>>> load-fadd-store chains to schedule. A store takes two cycles to
>>>> clear
>>>> its last pipeline stage. The fadd takes longer to compute its
result
>>>> (say 5 cycles), but can sustain a rate of 1 independent add per
>>>> cycle.
>>>> As the scheduling is bottom-up, it will schedule a store, then
it
>>>> has a
>>>> choice: it can schedule another store (at a 1 cycle penalty),
or it
>>>> can
>>>> schedule the fadd associated with the store it just scheduled
(with
>>>> a 4
>>>> cycle penalty due to operand latency). It seems that the
current
>>>> hybrid
>>>> scheduler will choose the fadd, I want a scheduler that will
make
>>>> the
>>>> opposite choice.
>>> 
>>> That's just wrong. You may need to look at
-debug-only=pre-RA-sched
>>> and debug your itinerary.
>> 
>> Andy, I've already looked at the debug output quite a bit; please
help
>> me understand what I'm missing...
>> 
>> First, looking at the code does seem to confirm my suspicion. This is
>> certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
>> will return the result of BUCompareLatency. That function first checks
>> for stalls and returns 1 or -1. Only after that does it look at the
>> relative latencies.
> 
> Looking at this more carefully, I think that I see the problem. The
> heights are set to account for the latencies:
> PredSU->setHeightToAtLeast(SU->getHeight() +
PredEdge->getLatency());
> 
> but the latencies are considered only if the node as an ILP scheduling
> preference (the default in TargetLowering.h is None):
>  bool LStall = (!checkPref || left->SchedulingPref == Sched::ILP)
&&
>    BUHasStall(left, LHeight, SPQ);
> ...
> 
> and the PPC backend does not override getSchedulingPreference.
> 
Right, even with sched=hybrid, the scheduler will fall back to register pressure
scheduling unless the target implements
TargetLowering::getSchedulingPreference. I forgot that piece of the puzzle.

You could try simply returning Sched::ILP from
PPCTargetLowering::getSchedulingPreference. If you have later have regpressure
problems, you can so something more complicated like
ARMTargetLowering::getSchedulingPreference.

BTW - If you set HasReadyFilter, the fadd (105) would not even appear in the
queue until the scheduler reached cycle [24]. So three additional stores would
have been scheduled first. HasReadyFilter effectively treats operand latency
stalls as strictly as pipeline hazards. It's not clear to me that want to do
that though if you fix getSchedulingPreference and do postRA scheduling later
anyway.

So it should work to do "hybrid" scheduling biased toward ILP, vs.
"ilp" scheduling which really does the opposite of what it's name
implies because it's initially biased toward regpressure.

-Andy
>> 
>> In addition, the stall computation is done using BUHasStall, and that
>> function only checks the current cycle. Without looking forward, I
don't
>> understand how it could know how long the pipeline hazard will last.
>> 
>> It looks like this may have something to do with the height. Can you
>> explain how that is supposed to work?
>> 
>> For the specific example: We start with the initial store...
>> 
>> GPRC: 4 / 31
>> F4RC: 1 / 31
>> 
>> Examining Available:
>> Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
>> 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
>> [ORD=94] [ID=102]
>> 
>> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910,
0x2c02c60,
>> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")>
[ORD=88] [ID=97]
>> 
>> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910,
0x2c02160,
>>
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
>> [ID=92]
>> 
>> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910,
0x2c01550,
>> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")>
[ORD=76] [ID=90]
>> 
>> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
>> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
>> [ORD=70] [ID=85]
>> 
>> *** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
>> 0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
>> arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]
>> 
>> then it schedules a "token factor" that is attached to the
address
>> computation required by the store (this is essentially a no-op,
>> right?)...
>> 
>> GPRC: 5 / 31
>> F4RC: 2 / 31
>> 
>> Examining Available:
>> Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
>> [ORD=94] [ID=5]
>> 
>> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710
[ORD=92]
>> [ID=105]
>> 
>> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910,
0x2c02c60,
>> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")>
[ORD=88] [ID=97]
>> 
>> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910,
0x2c02160,
>>
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
>> [ID=92]
>> 
>> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910,
0x2c01550,
>> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")>
[ORD=76] [ID=90]
>> 
>> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
>> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
>> [ORD=70] [ID=85]
>> 
>> *** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
>> 0x2c03a70 [ORD=94] [ID=5]
>> 
>> how here is the choice that we may want to be different...
>> 
>> GPRC: 5 / 31
>> F4RC: 2 / 31
>> 
>> Examining Available:
>> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710
[ORD=92]
>> [ID=105]
>> 
>> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910,
0x2c02c60,
>> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")>
[ORD=88] [ID=97]
>> 
>> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910,
0x2c02160,
>>
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")>
[ORD=82]
>> [ID=92]
>> 
>> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910,
0x2c01550,
>> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")>
[ORD=76] [ID=90]
>> 
>> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
>> 0x2c00940,
0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
>> [ORD=70] [ID=85]
>> 
>> (with more debug turned on, I also see a bunch of messages like:
>> *** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
>> arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
>> one of these for each of the other possible stores).
>> 
>> *** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
>> 0x2bf3710 [ORD=92] [ID=105]
>> 
>> why did it choose this fadd over any of the other stores? the
>> corresponding unit descriptions are:
>> 
>> SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
>>
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94]
>> [ID=102]
>> 
>>  # preds left       : 4
>>  # succs left       : 1
>>  # rdefs left       : 0
>>  Latency            : 7
>>  Depth              : 0
>>  Height             : 0
>>  Predecessors:
>>   val #0x2c11ff0 - SU(105): Latency=3
>>   val #0x2c0cdd0 - SU(32): Latency=1
>>   val #0x2c11db0 - SU(103): Latency=1
>>   ch  #0x2c0af70 - SU(5): Latency=0
>>  Successors:
>>   ch  #0x2c0ac10 - SU(2): Latency=1
>> 
>> SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]
>> 
>>  # preds left       : 2
>>  # succs left       : 1
>>  # rdefs left       : 1
>>  Latency            : 11
>>  Depth              : 0
>>  Height             : 0
>>  Predecessors:
>>   val #0x2c12110 - SU(106): Latency=6
>>   val #0x2c0d130 - SU(35): Latency=6
>>  Successors:
>>   val #0x2c11c90 - SU(102): Latency=3
>> 
>> Just from the debugging messages, it looks like what is happening is
>> that the scheduler is first rejecting the other stores because of
>> pipeline hazards and then picking the instruction with the lowest
>> latency. Looking at the code, it seems that this is exactly what it was
>> designed to do. If I'm wrong about that, please explain.
>> 
>> Thanks in advance,
>> Hal\
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111220/5216cfd9/attachment.html>

llvm dev - Dec 2011 - [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

Apparently Analagous Threads