thr3ads.net - llvm dev - [LLVMdev] [RFC] Scheduler Rework [Apr 2012]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2012-Apr-20 17:31 UTC

[LLVMdev] [RFC] Scheduler Rework

Hey Everyone,

I'd like to begin a project to rework the scheduler to address some
problems we've discovered on this end.  The goal is to get a more
configurable/flexible scheduler while simplifying maintenance by
separating policy from implementation to get independent and
interchangeable parts.

This is going to be challenging because we are still stuck on LLVM 2.9.
We will be upgrading to LLVM 3.1 when it is released but I really need
to start this work now to meet work deadlines.  But I want to make sure
I have something I can contribute upstream so I want to present the idea
and get some input on whether it is a good design and how best to move
it upstream.  It looks like the fundamentals of the scheduler haven't
changed much from 2.9 to 3.1 so that makes it a bit easier.

Most specifically, I would like some comment on whether it is reasonable
to just start a new RegReduction-like scheduler from scratch rather than
try to incrementally change the existing one.  I'm looking at a pretty
invasive reengineering of the scheduler structure so that it will have
familiar elements but be put together quite differently.  That's why I'm
not sure incremental change to the existing code makes sense.  The two
implementations could live side-by-side for a while if that makes sense.

The existing RegReduction scheduler design goes something like this:

            SchedulingPriorityQueue
                      ^
                      |
              RegReductionPQBase
                      ^
                      |
          RegReductionPriorityQueue<SF>
                                    ^
                                   /|\
                            ______/ | \______
                           |        |        |
                      Heuristic 1  ...  Heuristic N

The only point of customization right now is the template parameter to
RegReductionPriorityQueue.  This is the sort criteria for prioritizing
which node gets scheduled next.  SchedulingPriorityQueue has some
virtual functions which can be (and are) overridden by
RegReductionPQBase and/or RegReductionPriorityQueue.  However, the
current inheritance structure makes it impossible to tweak much even by
subclassing RegReductionPriorityQueue, either because the customization
points aren't virtual (easy to fix) or because customization points are
coupled (hard to fix without a redesign).

Here are some things we have found useful to tune.  Most are not
currently orthogonal and thus are not easily separated from each other:

- Node priority function (the SF parameter above).  This is
  well-separated and easily changed.

- Queue implementation.  The current implementation uses a std::vector<>
  and has very bad asymptotic performance (N^2 currently).  Previously
  this was a std::priority_queue<> but was changed because it missed
  some of the dynamic variance of node priorities.  We have found it
  useful and essential to be able to replace this but even after
  inheriting from RegReductionPriorityQueue it's pretty ugly because
  there are two spearate queues in the scheduler, one completely unused.
  Different queue types exhibit different heuristic behaviors based on
  how much dynamism they provide.  Thus the queue implementation is
  policy that really ought to be separated from the scheduler proper.

- Register pressure heuristic.  The ILP scheduler uses a fairly
  simplistic register pressure calculation which is pretty good in most
  cases but we have found it to not work well in a few important cases.
  It would be nice to make this a customization point.  It is a separate
  heuristic from the node priority function.  In fact some node priority
  functions look at the register pressure heuristic to calculate their
  result.  The two are orthogonal and we should be able to vary them
  independently.  Currently the register pressure heuristic is coupled
  tightly to RegReductionPQBase.  Again due to the data structures begin
  in RegReductionPQBase it is not convenient to simply make some of the
  heuristic methods virtual and override them.  Such an arrangement
  doesn't allow easy mix-and-match among register pressure and node
  priority heuristics.  Moreover, more elaborate register pressure
  calculations will almost certainly involve very different data
  structures.

Here's a quick sketch of a new design floating in my head.  I don't have
any code yet so this is subject to change but it gets the general idea
across.  The idea is to configure the scheduler via policy classes.


            SchedulingPriorityQueue
                      ^
                      |
          RegReductionPriorityQueue<Queue,   Pressure,         Sort>
                                     ^           ^              ^
                                    /|\          |              |
                               ____/ | \____     |              |
                              |      |      |    |              |
                           Queue 1  ...  Queue N |              |
                                                 |              |
                                                 |              |
                                                /|\             |
                                         ______/ | \_____       |
                                        |        |       |      |
                                    Pressure 1  ...  Pressure N |
                                                               /|\
                                                           ___/ | \___
                                                          |     |     |
                                                       Sort 1  ...  Sort N


I'll have to work out exactly how to allow the policy classes to
communicate, for example to allow the Sort heuristic to contact the
Pressure heuristic for information.  There are a few ways to do that.
One is to reintroduce RegReductionPQBase with some pure virtual
functions and use multiple inheritance from RegReductionPriorityQueue to
let the policy classes override them.  A better way, I think, would be
to use composition and pass pointers to each policy object to the
others.

The above design also should allow simple expansion in the future.  If
we identify more customization points we can introduce new policy
classes pretty easily.

Does the above look like a good design?  If so, I believe it is easiest
to build this from scratch (borrowing a lot of code from the current
scheduler) rather than try to incrementally morph the curent scheduler
to this design but I am open to guidance.

Thanks for your help!

                             -Dave

Andrew Trick

2012-Apr-23 18:09 UTC

head link

[LLVMdev] [RFC] Scheduler Rework

On Apr 20, 2012, at 10:31 AM, dag at cray.com wrote:
> I'd like to begin a project to rework the scheduler to address some
> problems we've discovered on this end.  The goal is to get a more
> configurable/flexible scheduler while simplifying maintenance by
> separating policy from implementation to get independent and
> interchangeable parts.
> 
> This is going to be challenging because we are still stuck on LLVM 2.9.
> We will be upgrading to LLVM 3.1 when it is released but I really need
> to start this work now to meet work deadlines.  But I want to make sure
> I have something I can contribute upstream so I want to present the idea
> and get some input on whether it is a good design and how best to move
> it upstream.  It looks like the fundamentals of the scheduler haven't
> changed much from 2.9 to 3.1 so that makes it a bit easier.
We plan to move to the MachineScheduler by 3.2. The structure is:

ScheduleDAG: Abstract DAG of SUnits and SDeps
  |
  v
ScheduleDAGInstrs: Build the DAG from MachineInstrs, each SUnit tied to an MI
                   Delimit the current "region" of code being
scheduled.
  |
  v
ScheduleDAGMI: Concrete implementation that supports both top-down and bottom-up
scheduling
               with live interval update. It divides the region into three
zones:
               Top-scheduled, bottom-scheduled, and unscheduled.

The ScheduleDAGMI constructor takes an instance of MachineSchedStrategy. This is
currently a very simply interface that provides pickNode(), releaseTopNode(),
releaseBottomNode().

The MachineScheduler is plugable at every level.

1. The pass itself is optional. Targets may disable or override it
   completely. For example, a target that implements global scheduling
   would need to override the standard pass.

2. Targets may create a MachineSchedRegistry instance that knows how
   to construct a custom ScheduleDAGInstrs. This is an easy way to
   reuse the standard scheduling pass and DAG builder. The driver will
   determine the scheduling region, but everything else is
   customizable.

3. New scheduling algorithms can be introduced by implementing new
   MachineSchedStrategies.  We can add a target hook to select the
   strategy if needed.

There's still plenty of opportunity to restructure the code and add
customization points. Keep in mind that the goal for the target independent code
is clarity and maintainability. Code reuse across custom schedulers is secondary
and I'm not currently thinking of designing a template library for custom
schedulers.

As much as possible, the standard scheduling algorithm will be built from
standalone utilities and data structures. The customizations that you describe
would all be handled by providing a new MachineSchedStrategy. Start by composing
your scheduler from the pieces that are available, e.g. HazardChecker,
RegisterPressure... (There's not much value providing a scheduling queue
abstraction on top of vector or priority_queue). From that point, we can improve
the design.

Here's what you can expect in the MachineScheduler:

1. Precise register pressure tracking with back off. I'm in the process of
checking in the pieces for this. It should be usable sometime in the next couple
of weeks.

2. Division of the target-defined resources into "interlocked" vs.
"buffered". The HazardChecker will continue to handle the interlocks
for the resources that the hardware handles in order. Buffered resources, which
the hardware can handle out of order. These will be considered separately for
scheduling priority. We will also make a distinction between minimum and
expected operation latency.

3. A register pressure reducing scheduler that considers interlocks and register
pressure. This will prioritize register reuse above the pressure limits and
include a very simple heuristic for avoiding high register pressure situations.

4. A balanced scheduler that still prioritizes interlocks and register pressure
limits, but balances buffered resource usage with register pressure avoidance.

5. Improved heuristics. For example, we can precompute register lineages and use
that information to avoid high register pressure situations.

The overarching goal of the standard scheduling algorithm is to handle the most
important cases that the SDScheduler and PostRAScheduler handle today without
unnecessarily shuffling instructions. This should result in more debugable
generated code with more stable performance. The goal of the scheduler framework
is to support the standard scheduler while providing a place for targets to
introduce a custom scheduler, or simply any optimization that requires the
scheduling DAG.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120423/e008fc18/attachment.html>

dag at cray.com

2012-Apr-24 15:59 UTC

head link

[LLVMdev] [RFC] Scheduler Rework

Andrew Trick <atrick at apple.com> writes:
> We plan to move to the MachineScheduler by 3.2. The structure is:
How hard will this be to backport to 3.1?  Has woprk on this started
yet?
> ScheduleDAG: Abstract DAG of SUnits and SDeps
>   |
>   v
> ScheduleDAGInstrs: Build the DAG from MachineInstrs, each SUnit tied to an
MI
>                    Delimit the current "region" of code being
scheduled.
>   |
>   v
> ScheduleDAGMI: Concrete implementation that supports both top-down and
bottom-up scheduling
>                with live interval update. It divides the region into three
zones:
>                Top-scheduled, bottom-scheduled, and unscheduled.
So does this happen after SDNodes are converted to MachineInstrs?  It's
not clear to me given your description of ScheduleDAGInstrs.  I assume
it uses the current SDNode->SUnit mechanism but I just want to make
sure.
> 3. New scheduling algorithms can be introduced by implementing new
>    MachineSchedStrategies.  We can add a target hook to select the
>    strategy if needed.
This sounds like what we want and yes, it should be target-selectable.
> There's still plenty of opportunity to restructure the code and add
> customization points. Keep in mind that the goal for the target
> independent code is clarity and maintainability. Code reuse across
> custom schedulers is secondary and I'm not currently thinking of
> designing a template library for custom schedulers.
The two are not contradictory but for me I care less about the specific
structure than I do about the flexibility to mix and match heuristics.
It just struck me that templates and policy classes is the natural way
to do it.

I'm glad to hear the top-down scheduler will get some attention.  We'll
be wanting to use that.
> As much as possible, the standard scheduling algorithm will be built
> from standalone utilities and data structures. The customizations that
> you describe would all be handled by providing a new
> MachineSchedStrategy. 
Makes sense to me.
> Start by composing your scheduler from the pieces that are available,
> e.g.  HazardChecker, RegisterPressure... (There's not much value
> providing a scheduling queue abstraction on top of vector or
> priority_queue).
What do you mean by this last point?  We absolutely want to be able to
swap out different queue implementations.  There is a significant
compile time tradeoff to be made here.
> 1. Precise register pressure tracking with back off. I'm in the
> process of checking in the pieces for this. It should be usable
> sometime in the next couple of weeks.
Great!
> 2. Division of the target-defined resources into "interlocked"
> vs. "buffered". The HazardChecker will continue to handle the
> interlocks for the resources that the hardware handles in
> order. 
So by "interlocks" you mean hardware-implemented interlocks?  So that
the scheduler will attempt to avoid them.  Not that we have a machine
like this, but what about machines with no interlocks where software is
responsible for correctness (VLIW, early MIPS, etc.)?  I suppose they
can be handled with a similar mechanism but the latter is much more
strict than the former.
> Buffered resources, which the hardware can handle out of order. These
> will be considered separately for scheduling priority. We will also
> make a distinction between minimum and expected operation latency.
Does this mean you want to model the various queues, hardware scheduling
windows, etc.?  I'm just trying to get a clearer idea of what this
means.
> 3. A register pressure reducing scheduler that considers interlocks
> and register pressure. This will prioritize register reuse above the
> pressure limits and include a very simple heuristic for avoiding high
> register pressure situations.
>
> 4. A balanced scheduler that still prioritizes interlocks and register
> pressure limits, but balances buffered resource usage with register
> pressure avoidance.
>
> 5. Improved heuristics. For example, we can precompute register
> lineages and use that information to avoid high register pressure
> situations.
This all sounds great and is exactly what we're looking for.

As I said, this is a time-critical thing for us.  Is there any way I can
help to move this along?

Thanks!

                             -Dave

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Apr 2012 - [LLVMdev] [RFC] Scheduler Rework

[LLVMdev] [RFC] Scheduler Rework

[LLVMdev] [RFC] Scheduler Rework

[LLVMdev] [RFC] Scheduler Rework

Reasonably Related Threads