thr3ads.net - llvm dev - [llvm-dev] mischeduler (pre-RA) experiments [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2017-Nov-23 10:53 UTC

[llvm-dev] mischeduler (pre-RA) experiments

Hi,

I have been experimenting for a while with tryCandidate() method of the 
pre-RA mischeduler. I have by chance found some parameters that give 
quite good results on benchmarks on SystemZ (on average 1% improvement, 
some improvements of several percent and very little regressions). 
Basically, I add a "latency heuristic boost" just above processor 
resources checking:

tryCandidate() {
    ...

    // Avoid increasing the max pressure of the entire region.
    if (DAG->isTrackingPressure() &&
tryPressure(TryCand.RPDelta.CurrentMax,
         Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
      return;

    /// INSERTION POINT

    ...
}

I had started to experiment with adding tryLatency() in various places, 
and found this to be the best spot for SystemZ/SPEC-2006. This gave 
noticeable improvements immediately that were to good to ignore, so I 
started figuring out things about the regressions that of course also 
showed up. Eventually I have come up after many iterations a combined 
heuristic that reads:

if (((TryCand.Latency >= 7 && "Longest latency of any SU in
DAG" < 15) ||
      "Number of SUnits in DAG" > 180)
      &&
      tryLatency(TryCand, Cand, *Zone))
         return;

In English: do tryLatency either if the latency of the candidate is >= 7 
and the DAG has no really long latency SUs (lat > 15), or alternatively 
always if the DAG is really big (>180 SUnits).

I am now looking for opinions on what to do next with this.

pros:

- Clearly beneficial on benchmarks.

- All the register pressure heuristics have been run, so there *should* 
not be increased spilling.

- This is mainly giving latency priority over resource balancing, which 
seems harmless if it shows to be beneficial.

- This gives higher ILP (register usage) and gives the SystemZ post-RA 
scheduler more freedom to improve decoder grouping etc.

cons:

- I am not sure if it is acceptable to have limits like this to control 
scheduling? Code generation can change in the years to come and who 
knows if those limits are safe then...

On the other hand, as I have been rebasing just recently, results have 
varied a bit but stayed stably beneficial. I have also managed to remove 
the sharp limit and improve on this concern a bit by having a zone from 
170-190 that makes the change more gradual as the DAG becomes "big".
The
values of 7 and 15 could as well be 6/8 or 30, so it's not really hyper 
sensitive either at the moment, I'd say.

I don't know about any better way of getting these experimental results, 
but of course it would be nice to know more and be able to say "why" 
this works, but this is indeed complex given the high OOO nature of 
SystemZ in addition to the regalloc effects etc.

Then there is the matter of implementation - could this become some kind 
of "latencyBoost" hook in the generic scheduler (after all, other 
targets might benefit also), or would SystemZ have to derive its own 
SchedStrategy (which isn't very nice if you just want to change one 
thing and still get future improvements of common code)?

I would appreciate any help and suggestions!

Jonas

Andrew Trick via llvm-dev

2017-Nov-24 20:58 UTC

head link

[llvm-dev] mischeduler (pre-RA) experiments

> On Nov 23, 2017, at 2:53 AM, Jonas Paulsson <paulsson at
linux.vnet.ibm.com> wrote:
> 
> Hi,
> 
> I have been experimenting for a while with tryCandidate() method of the
pre-RA mischeduler. I have by chance found some parameters that give quite good
results on benchmarks on SystemZ (on average 1% improvement, some improvements
of several percent and very little regressions). Basically, I add a
"latency heuristic boost" just above processor resources checking:
> 
> tryCandidate() {
>    ...
> 
>    // Avoid increasing the max pressure of the entire region.
>    if (DAG->isTrackingPressure() &&
tryPressure(TryCand.RPDelta.CurrentMax,
>         Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
>      return;
> 
>    /// INSERTION POINT
> 
>    ...
> }
> 
> I had started to experiment with adding tryLatency() in various places, and
found this to be the best spot for SystemZ/SPEC-2006. This gave noticeable
improvements immediately that were to good to ignore, so I started figuring out
things about the regressions that of course also showed up. Eventually I have
come up after many iterations a combined heuristic that reads:
> 
> if (((TryCand.Latency >= 7 && "Longest latency of any SU in
DAG" < 15) ||
>      "Number of SUnits in DAG" > 180)
>      &&
>      tryLatency(TryCand, Cand, *Zone))
>         return;
> 
> In English: do tryLatency either if the latency of the candidate is >= 7
and the DAG has no really long latency SUs (lat > 15), or alternatively
always if the DAG is really big (>180 SUnits).
> 
> I am now looking for opinions on what to do next with this.
> 
> pros:
> 
> - Clearly beneficial on benchmarks.
> 
> - All the register pressure heuristics have been run, so there *should* not
be increased spilling.
> 
> - This is mainly giving latency priority over resource balancing, which
seems harmless if it shows to be beneficial.
> 
> - This gives higher ILP (register usage) and gives the SystemZ post-RA
scheduler more freedom to improve decoder grouping etc.
> 
> cons:
> 
> - I am not sure if it is acceptable to have limits like this to control
scheduling? Code generation can change in the years to come and who knows if
those limits are safe then...
> 
> On the other hand, as I have been rebasing just recently, results have
varied a bit but stayed stably beneficial. I have also managed to remove the
sharp limit and improve on this concern a bit by having a zone from 170-190 that
makes the change more gradual as the DAG becomes "big". The values of
7 and 15 could as well be 6/8 or 30, so it's not really hyper sensitive
either at the moment, I'd say.
> 
> I don't know about any better way of getting these experimental
results, but of course it would be nice to know more and be able to say
"why" this works, but this is indeed complex given the high OOO nature
of SystemZ in addition to the regalloc effects etc.
> 
> Then there is the matter of implementation - could this become some kind of
"latencyBoost" hook in the generic scheduler (after all, other targets
might benefit also), or would SystemZ have to derive its own SchedStrategy
(which isn't very nice if you just want to change one thing and still get
future improvements of common code)?
> 
> I would appreciate any help and suggestions!
> 
> Jonas
Your analysis makes sense and approach seems reasonable. Regarding the
implementation...

I try to avoid using hooks for heuristics. We do have some hooks for
enabling/disabling features of the generic scheduling infrastructure. Those are
intended to make it easy to see which parts can be controlled independently so
backends can pick and choose scheduling functionality.

My concern is that it becomes very difficult to modify the generic scheduler
once it becomes laden with machine specific thresholds. We
want the generic scheduler to be unencumbered with details so that it's easy
to redesign/refactor as functionality evolves.

For heuristics that result from careful tuning, it's best for a backend to
define its own. Other backends can easily copy that approach, making their own
adjustments/additions without the need to reevaluate another platform's
workloads.

Once heuristics have been thoroughly tuned for a microarchitecture, I think it
should have its own scheduling strategy. I don't think you want other
scheduling changes to affect that target until someone can reevaluate your
benchmarks.

Of course, you want to duplicate as little of the generic scheduling logic as
you can. So I think the challenge is how to expose the
generic scheduler's functionality as a base class or composition of
utilities so that defining your strategy doesn't require too much
copy-paste.

-Andy

陳韋任 via llvm-dev

2017-Nov-25 02:00 UTC

head link

[llvm-dev] mischeduler (pre-RA) experiments

>
> Of course, you want to duplicate as little of the generic scheduling logic
> as you can. So I think the challenge is how to expose the
> generic scheduler's functionality as a base class or composition of
> utilities so that defining your strategy doesn't require too much
> copy-paste.

Isn't GCNMaxOccupancySchedStrategy [1] already an example on
using GenericScheduler as the base class?

[1] http://llvm.org/doxygen/classllvm_1_1GCNMaxOccupancySchedStrategy.html


-- 
Wei-Ren Chen (陳韋任)
Homepage: https://people.cs.nctu.edu.tw/~chenwj
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171125/7633b9a0/attachment.html>

Florian Hahn via llvm-dev

2017-Nov-27 15:57 UTC

head link

[llvm-dev] mischeduler (pre-RA) experiments

Hi,

On 23/11/2017 10:53, Jonas Paulsson via llvm-dev wrote:> Hi,
> 
> I have been experimenting for a while with tryCandidate() method of the 
> pre-RA mischeduler. I have by chance found some parameters that give 
> quite good results on benchmarks on SystemZ (on average 1% improvement, 
> some improvements of several percent and very little regressions). 
> Basically, I add a "latency heuristic boost" just above processor
> resources checking:
> 
> tryCandidate() {
>     ...
> 
>     // Avoid increasing the max pressure of the entire region.
>     if (DAG->isTrackingPressure() && 
> tryPressure(TryCand.RPDelta.CurrentMax,
>          Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
>       return;
> 
>     /// INSERTION POINT
> 
>     ...
> }
> 
> I had started to experiment with adding tryLatency() in various places, 
> and found this to be the best spot for SystemZ/SPEC-2006. This gave 
> noticeable improvements immediately that were to good to ignore, so I 
> started figuring out things about the regressions that of course also 
> showed up. Eventually I have come up after many iterations a combined 
> heuristic that reads:
> 
> if (((TryCand.Latency >= 7 && "Longest latency of any SU in
DAG" < 15) ||
>       "Number of SUnits in DAG" > 180)
>       &&
>       tryLatency(TryCand, Cand, *Zone))
>          return;
> 
> In English: do tryLatency either if the latency of the candidate is >= 7
> and the DAG has no really long latency SUs (lat > 15), or alternatively 
> always if the DAG is really big (>180 SUnits).
> 

Thanks for those experiments! I made similar observations when trying to 
tune the scheduling heuristics for AArch64/ARM cores. For example, I put 
this patch up for review, that makes scheduling for latency more 
aggressive https://reviews.llvm.org/D38279. It gave +0.74% on SPEC2017 
score on Cortex-A57. But I never really pushed any further on this so far.

The thing I found is that it seems like when deciding to schedule for 
latency during bottom-up scheduling we use CurrZone.getCurrCycle() to 
get the number of issued cycles, which is then added to the remaining 
latency. Unless I miss something, the cycle will get bumped by one after 
scheduling an instruction, regardless of the latency. It seems like 
CurrZone.getScheduledLatency() would more accurately represent to 
latency scheduled currently, but I am probably missing something.

The test case I was looking into on AArch64 was, where the long latency 
instruction SDIV was not scheduled as early as possible.

define hidden i32 @foo(i32 %a, i32 %b, i32 %c, i32* %d) 
local_unnamed_addr #0 {
entry:
   %xor = xor i32 %c, %b
   %ld = load i32, i32* %d
   %add = add nsw i32 %xor, %ld
   %div = sdiv i32 %a, %b
   %sub = sub i32 %div, %add
   ret i32 %sub
}

Cheers,
Florian

llvm dev - Nov 2017 - mischeduler (pre-RA) experiments

[llvm-dev] mischeduler (pre-RA) experiments

[llvm-dev] mischeduler (pre-RA) experiments

[llvm-dev] mischeduler (pre-RA) experiments

[llvm-dev] mischeduler (pre-RA) experiments