Hal Finkel
2011-Dec-19 23:19 UTC
[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
On Mon, 2011-12-19 at 07:41 -0800, Andrew Trick wrote:> On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfinkel at anl.gov> wrote: > > > On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote: > > Now, to generate the best PPC schedules, there is one thing you may > >> want to override. The scheduler's priority function has a > >> HasReadyFilter attribute (enum). It can be overriden by specializing > >> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP > >> scheduling, and maximizes the instructions that can issue in one > >> group, regardless of register pressure. We still care about register > >> pressure enough in ARM to avoid enabling this. I'm really not sure how > >> much it will help on modern PPC implementations though. > >> hybrid_ls_rr_sort > > > > Can this be done without modifying common code? It looks like > > hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp. > > > > Thanks again, > > Hal > > Right. You would need to specialize the priority queue logic. A small amount of common code. > AndyAndy, I played around with this some today for my PPC 440 chips. These are embedded chips (multiple pipelines but in-order), and may be more similar to your ARMs than to the PPC-970 style designs... I was able to get reasonable PPC 440 code generation by using the ILP scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my load/store reordering patch). This worked significantly better than using either hybrid or ilp alone (with or without setting HasReadyFilter). I was looking at my primary use case which is partially-unrolled loops with loads, stores and floating-point calculations. This seems to work b/c ILP first groups the instructions to extract parallelism and then the post-RA scheduler breaks up the groups to avoid stalls. This allows the scheduler to find its way out of what seems to be a "local minimum" of sorts, whereby it wants to schedule each unrolled iteration of the loop sequentially. The reason why this seems to occur is that the hybrid scheduler would prefer to suffer a large data-dependency delay over a shorter full-pipeline delay. Do you know why it would do this? (you can see PR11589 for an example if you'd like). Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be used? Is there a reason that this is a compile-time constant? Both Hybrid and ILP have isReady() functions. I can certainly propose a patch to make them command-line options. Thanks again, Hal -- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Andrew Trick
2011-Dec-20 06:14 UTC
[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
On Dec 19, 2011, at 3:19 PM, Hal Finkel wrote:> On Mon, 2011-12-19 at 07:41 -0800, Andrew Trick wrote: >> On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfinkel at anl.gov> wrote: >> >>> On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote: >>> Now, to generate the best PPC schedules, there is one thing you may >>>> want to override. The scheduler's priority function has a >>>> HasReadyFilter attribute (enum). It can be overriden by specializing >>>> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP >>>> scheduling, and maximizes the instructions that can issue in one >>>> group, regardless of register pressure. We still care about register >>>> pressure enough in ARM to avoid enabling this. I'm really not sure how >>>> much it will help on modern PPC implementations though. >>>> hybrid_ls_rr_sort >>> >>> Can this be done without modifying common code? It looks like >>> hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp. >>> >>> Thanks again, >>> Hal >> >> Right. You would need to specialize the priority queue logic. A small amount of common code. >> Andy > > Andy, > > I played around with this some today for my PPC 440 chips. These are > embedded chips (multiple pipelines but in-order), and may be more > similar to your ARMs than to the PPC-970 style designs... > > I was able to get reasonable PPC 440 code generation by using the ILP > scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my > load/store reordering patch). This worked significantly better than > using either hybrid or ilp alone (with or without setting > HasReadyFilter). I was looking at my primary use case which is > partially-unrolled loops with loads, stores and floating-point > calculations. > > This seems to work b/c ILP first groups the instructions to extract > parallelism and then the post-RA scheduler breaks up the groups to avoid > stalls. This allows the scheduler to find its way out of what seems to > be a "local minimum" of sorts, whereby it wants to schedule each > unrolled iteration of the loop sequentially. The reason why this seems > to occur is that the hybrid scheduler would prefer to suffer a large > data-dependency delay over a shorter full-pipeline delay. Do you know > why it would do this? (you can see PR11589 for an example if you'd > like).The "ilp" scheduler has several heuristics designed to compensate for lack of itinerary. Each of those heuristics has a flag, so you can see what works for your target. I've never used that scheduler with an itinerary, but it should work. It's just that some of the heuristics effectively override the hazard checker. The "hybrid" scheduler depends more on the itinerary/hazard checker. It's less likely to schedule instructions close together if they may induce a pipeline stall, regardless of operand latency.> Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be > used? Is there a reason that this is a compile-time constant? Both > Hybrid and ILP have isReady() functions. I can certainly propose a patch > to make them command-line options.It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling! -Andy
Hal Finkel
2011-Dec-20 06:53 UTC
[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
On Mon, 2011-12-19 at 22:14 -0800, Andrew Trick wrote:> On Dec 19, 2011, at 3:19 PM, Hal Finkel wrote: > > > On Mon, 2011-12-19 at 07:41 -0800, Andrew Trick wrote: > >> On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfinkel at anl.gov> wrote: > >> > >>> On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote: > >>> Now, to generate the best PPC schedules, there is one thing you may > >>>> want to override. The scheduler's priority function has a > >>>> HasReadyFilter attribute (enum). It can be overriden by specializing > >>>> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP > >>>> scheduling, and maximizes the instructions that can issue in one > >>>> group, regardless of register pressure. We still care about register > >>>> pressure enough in ARM to avoid enabling this. I'm really not sure how > >>>> much it will help on modern PPC implementations though. > >>>> hybrid_ls_rr_sort > >>> > >>> Can this be done without modifying common code? It looks like > >>> hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp. > >>> > >>> Thanks again, > >>> Hal > >> > >> Right. You would need to specialize the priority queue logic. A small amount of common code. > >> Andy > > > > Andy, > > > > I played around with this some today for my PPC 440 chips. These are > > embedded chips (multiple pipelines but in-order), and may be more > > similar to your ARMs than to the PPC-970 style designs... > > > > I was able to get reasonable PPC 440 code generation by using the ILP > > scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my > > load/store reordering patch). This worked significantly better than > > using either hybrid or ilp alone (with or without setting > > HasReadyFilter). I was looking at my primary use case which is > > partially-unrolled loops with loads, stores and floating-point > > calculations. > > > > This seems to work b/c ILP first groups the instructions to extract > > parallelism and then the post-RA scheduler breaks up the groups to avoid > > stalls. This allows the scheduler to find its way out of what seems to > > be a "local minimum" of sorts, whereby it wants to schedule each > > unrolled iteration of the loop sequentially. The reason why this seems > > to occur is that the hybrid scheduler would prefer to suffer a large > > data-dependency delay over a shorter full-pipeline delay. Do you know > > why it would do this? (you can see PR11589 for an example if you'd > > like). > > The "ilp" scheduler has several heuristics designed to compensate for lack of itinerary. Each of those heuristics has a flag, so you can see what works for your target. I've never used that scheduler with an itinerary, but it should work. It's just that some of the heuristics effectively override the hazard checker. > > The "hybrid" scheduler depends more on the itinerary/hazard checker. It's less likely to schedule instructions close together if they may induce a pipeline stall, regardless of operand latency. >I'd prefer to have a scheduler that just does what I want :) -- How can I make a modified version of the hybrid scheduler that will weight operand latency and pipeline stalls more equally? Here's my "thought experiment" (from PR11589): I have a bunch of load-fadd-store chains to schedule. A store takes two cycles to clear its last pipeline stage. The fadd takes longer to compute its result (say 5 cycles), but can sustain a rate of 1 independent add per cycle. As the scheduling is bottom-up, it will schedule a store, then it has a choice: it can schedule another store (at a 1 cycle penalty), or it can schedule the fadd associated with the store it just scheduled (with a 4 cycle penalty due to operand latency). It seems that the current hybrid scheduler will choose the fadd, I want a scheduler that will make the opposite choice.> > Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be > > used? Is there a reason that this is a compile-time constant? Both > > Hybrid and ILP have isReady() functions. I can certainly propose a patch > > to make them command-line options. > > It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!Indeed, that makes sense. Thanks again, Hal> > -Andy-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Possibly Parallel Threads
- [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
- [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
- [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
- [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)
- [LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)