My 2c... Even though I understand it might be way off in the future, but we are talking about long term plans here anyway. Also as a VLIW backend maintainer, I just have to say it :) - We do need to have a way to assign bundles much earlier than we do now. And it needs to be intertwined with scheduling (Bundler currently reuses a good chunk of scheduler infrastructure). It is also obvious that it will have adverse effect on all the downstream passes. It is further insulting due to the fact that bundling is trivial to do during scheduling, but it goes hard against the original assumptions made elsewhere. Re-bundling is also a distinct task that might need to be addressed in this context. - We do need to have at least a distant plan for global scheduling. BB scope is nice and manageable, but I can easily get several percent of "missed" performance by simply implementing a pull-up pass at the very end of code generation... meaning multiple opportunities were lost earlier. Current way to express and maintain state needed for global scheduling remains to be improved. - SW pipelining and scheduler interaction with it. When (not if:) we will have a robust SW pipeliner it will likely to take place before first scheduling pass, and we do not want to "undo" some decision made there. Sergei -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Andrew Trick > Sent: Friday, May 11, 2012 12:29 AM > To: Hal Finkel > Cc: dag at cray.com; wrf at cray.com; LLVM Developers Mailing List > Subject: Re: [LLVMdev] Scheduler Roadmap > > On May 10, 2012, at 9:06 PM, Hal Finkel <hfinkel at anl.gov> wrote: > >> - Target pass configuration: DONE > >> - MachineScheduler pass framework: DONE > >> - MI Scheduling DAG: DONE > >> - AliasAnalysis aware DAG option: In review (Sergei) > >> - Bidirectional list scheduling: DONE > >> - LiveInterval Update: WIP (simple instruction reordering is > >> supported) > >> - Target-independent precise modeling of register pressure: DONE > >> - Register pressure reduction scheduler: WIP > >> - Support for existing HazardChecker plugin > > > > Is support for the existing hazard detectors working now? [it does > not > > say DONE or WIP here, but your comment below implies, I think, that > it > > is at least partially working]. > > Glad you're interested. I can explain. We have several important tools > in LLVM that most schedulers will need. That's what I was listing below > (Configurable pass, DAG, LI update, RegPressure, Itinerary, > HazardChecker--normally called a reservation table). > > I really should have also mentioned the DFAPacketizer developed by the > Hexagon team. It's being used by their VLIW scheduler, but not by the > new "standard" scheduler that I'm working on. > > Now that I mentioned that, I should mention MachineInstrBundles, which > was a necessary IR feature to support the VLIW scheduler, but has other > random uses--sometimes we want to glue machine instructions. > > HazardChecker was already being used by the PostRA scheduler before I > started working on infrastructure for a new scheduler. So it's there, > and can be used by custom schedulers. > > My first goal was to complete all of these pieces. They're in pretty > good shape now but not well tested. The target independent model for > register pressure derived from arbitrary register definitions was by > far the most difficult aspect. Now I need to develop a standard > scheduling algorithm that will work reasonably well for any target > given the register description and optionally a scheduling itinerary. > > The register pressure reduction heuristic was the first that I threw > into the standard scheduler because it's potentially useful by itself. > It's WIP. > > I haven't plugged in the HazardChecker, but it's quite straightforward. > > At that point, I'll have two competing scheduling constraints and will > begin implementing a framework for balancing those constraints. I'll > also add fuzzy constraints such as expected latency and other cpu > resources. When I get to that point, I'll explain more, and I hope you > and others will follow along and help with performance analysis and > heuristics. > > I will point out one important aspect of the design now. If scheduling > is very important for your target's performance, and you are highly > confident that you model your microarchitecture effectively and have > just the right heuristics, then it might make sense to give the > scheduler free reign to shuffle the instructions. The standard > MachineScheduler will not make that assumption. It certainly can be > tuned to be as aggressive as we like, but unless there is high level of > confidence that reordering instructions will be beneficial, we don't > want to do it. Rule #1 is not to follow blind heuristics that > reschedule reasonable code into something pathologically bad. This > notion of confidence is not something schedulers typically have, and is > fundamental to the design. > > For example, most schedulers have to deal with opposing constraints of > register pressure and ILP. An aggressive way to deal with this is by > running two separate scheduling passes. First top-down to find the > optimal latency, then bottom-up to minimize resources needed to achieve > that latency. Naturally, after the first pass, you've shuffled > instructions beyond all recognition. Instead, we deal with this problem > by scheduling in both directions simultaneously. At each point, we know > which resources and constraints are likely to impact the cpu pipeline > in both the top and bottom of the scheduling region. Doing this doesn't > solve any fundamental problem, but it gives the scheduler great freedom > at each point, including the freedom to do absolutely nothing, which is > probably exactly what you want for a fair amount of X86 code. > > >> - New target description feature: buffered resources > >> - Modeling min latency, expected latency, and resources constraints > > > > Can you comment on how min and expected latency will be used in the > > scheduling? > > In the new scheduler's terminology, min latency is an interlocked > resource, and expected latency is a buffered resource. Interlocked > resources are used to form instruction groups (for performance only, > not correctness). For out-of-order targets with register rename, we > can use zero-cycle min latency so there is no interlock within an issue > groups. Instead we know expected latency of the scheduled instructions > relative to the critical path. We can balance the schedule so that > neither the expected latency of the top nor bottom scheduled > instructions exceed the overall critical path. This way, we will slice > up two very long independent chains into neat chunks, instead of the > random shuffling that we do today. > > >> - Heuristics that balance interlocks, regpressure, latency and > >> buffered resources > >> > >> For targets where scheduling is critical, I encourage developers who > >> stay in sync with trunk to write their own target-specific scheduler > >> based on the pieces that are already available. Hexagon developers > >> are doing this now. The LLVM toolkit for scheduling is all there-- > not > >> perfect, but ready for developers. > >> > >> - Pluggable MachineScheduler pass > >> - Scheduling DAG > >> - LiveInterval Update > >> - RegisterPressure tracking > >> - InstructionItinerary and HazardChecker (to be extended) > >> > >> If you would simply like improved X86 scheduling without rolling > your > >> own, then providing feedback and test cases is useful so we can > >> incorporate improvements into the standard scheduler while it's > being > >> developed. > > > > Does this mean that we're going to see a new X86 scheduling paradigm, > > or is the existing ILP heuristic, in large part, expected to stay? > > It's a new paradigm but not a change in focus--we're not modeling the > microarchitecture in any greater detail. Although other contributors > are encouraged to do that. > > Both schedulers will be supported for a time. In fact it will make > sense to run both in the same compile, until MISched is good enough to > take over. It will be easy to determine when one scheduler is doing > better than the other. I'm relying on you to tell me when it's doing > the wrong thing. > > -Andy > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Sergei Larin <slarin at codeaurora.org> writes:> - We do need to have a way to assign bundles much earlier than we do now.Yeah, I can imagine why this would be useful.> And it needs to be intertwined with scheduling (Bundler currently reuses a > good chunk of scheduler infrastructure).Just to clarify, is the need due to the current bundling implementation of reusing scheduler infrastructure or is there a more fundamental reason the two should be tied together? I can imagine some advantages of fusing the two but I'm no VLIW expert.> It is also obvious that it will have adverse effect on all the > downstream passes.How so? Isn't the bundle representation supposed to be fairly transparent to passes that don't care about it?> It is further insulting due to the fact that bundling is trivial to do > during scheduling, but it goes hard against the original assumptions > made elsewhere.Can you explain more about this?> Re-bundling is also a distinct task that might need to be addressed in > this context.Rebundling is certainly useful. I'm not sure what you mean by "in this context."> - We do need to have at least a distant plan for global scheduling.Yes, definitely!> BB scope is nice and manageable, but I can easily get several percent > of "missed" performance by simply implementing a pull-up pass at the > very end of code generation... meaning multiple opportunities were > lost earlier. Current way to express and maintain state needed for > global scheduling remains to be improved.We could also use a global scheduler in the medium-term though it's not absolutely critical. A nice clean infrastructure to support global scheduling with multiple heuristics, etc. would be very valuable. It would also be a lot of work. :)> - SW pipelining and scheduler interaction with it. When (not if:) we will > have a robust SW pipeliner it will likely to take place before first > scheduling pass, and we do not want to "undo" some decision made there.Right. We (the LLVM community) will need ways to mark regions "don't schedule." -Dave
Dave, Thank you for your interest. Please see my replies below. Sorry that my terminology is not as crisp as Andy's, but I think you can see what I mean. Sergei -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.> -----Original Message----- > From: dag at cray.com [mailto:dag at cray.com] > Sent: Friday, May 11, 2012 12:14 PM > To: Sergei Larin > Cc: 'Andrew Trick'; 'Hal Finkel'; wrf at cray.com; 'LLVM Developers > Mailing List' > Subject: Re: [LLVMdev] Scheduler Roadmap > > Sergei Larin <slarin at codeaurora.org> writes: > > > - We do need to have a way to assign bundles much earlier than we > do now. > > Yeah, I can imagine why this would be useful. > > > And it needs to be intertwined with scheduling (Bundler currently > > reuses a good chunk of scheduler infrastructure). > > Just to clarify, is the need due to the current bundling implementation > of reusing scheduler infrastructure or is there a more fundamental > reason the two should be tied together? I can imagine some advantages > of fusing the two but I'm no VLIW expert.[Larin, Sergei] A little bit of both. Current bundler uses DAG dep builder to facilitate its analysis, for that it actually instantiates full MI scheduler... without scheduling. Ideally scheduler itself should be able to produce bundled code (since it has the best picture of machine resources and instruction stream), but standalone bundler by itself might be needed to re-bundle incrementally (which it does not do right now). In short - I see bundler as a utility, not as a pass.> > > It is also obvious that it will have adverse effect on all the > > downstream passes. > > How so? Isn't the bundle representation supposed to be fairly > transparent to passes that don't care about it?[Larin, Sergei] Kind of. Once bundles are finalized, bundle header become a new "super instructions", and if a pass does not need to look at individual (MI) instructions, there will not be any difference for it. But if a pass need to deal with individual MIs, things get interesting. For one, we lack API for moving/adding/removing individual MIs to/from finalized bundles. We also lack API to move MIs between BBs in presence of bundles. Live Intervals obviously do not work with bundles... Two, semantics (dependencies) within a bundle are parallel (think { r0 r1; r1 = r0 } in serial vs. parallel semantics) and if a pass needs to "understand" it, it will need to be "taught" how to do it. This is where incremental rebundling might come in handy. Fortunately we currently do bundling fairly late, so it is not an issue yet.> > > It is further insulting due to the fact that bundling is trivial to > do > > during scheduling, but it goes hard against the original assumptions > > made elsewhere. > > Can you explain more about this?[Larin, Sergei] The core of bundler is the DFA state machine, which also _must_ be a part of any VLIW scheduler, so in my Hexagon VLIW "custom" scheduler (I actually have two - SDNode and MI based) I virtually __create__ bundles, but discard them at the end of pass, only to recreate them again later in the standalone bundler. Second attempt is riddled with additional false dependencies (anti, output etc.) introduced by the register allocation, so bundling quality is affected.> > > Re-bundling is also a distinct task that might need to be addressed > in > > this context.[Larin, Sergei] I think above explanation covers this.> > Rebundling is certainly useful. I'm not sure what you mean by "in this > context." > > > - We do need to have at least a distant plan for global scheduling. > > Yes, definitely! > > > BB scope is nice and manageable, but I can easily get several percent > > of "missed" performance by simply implementing a pull-up pass at the > > very end of code generation... meaning multiple opportunities were > > lost earlier. Current way to express and maintain state needed for > > global scheduling remains to be improved. > > We could also use a global scheduler in the medium-term though it's not > absolutely critical. A nice clean infrastructure to support global > scheduling with multiple heuristics, etc. would be very valuable. It > would also be a lot of work. :)[Larin, Sergei] I will need to start doing this very soon to meet my performance goals, but extensive discussion and generic support might take some time to crystallize... but indeed critically needed.> > > - SW pipelining and scheduler interaction with it. When (not if:) > we > > will have a robust SW pipeliner it will likely to take place before > > first scheduling pass, and we do not want to "undo" some decision > made there. > > Right. We (the LLVM community) will need ways to mark regions "don't > schedule." > > -Dave
On May 11, 2012, at 9:53 AM, Sergei Larin <slarin at codeaurora.org> wrote:> - We do need to have a way to assign bundles much earlier than we do now. > And it needs to be intertwined with scheduling (Bundler currently reuses a > good chunk of scheduler infrastructure). It is also obvious that it will > have adverse effect on all the downstream passes. It is further insulting > due to the fact that bundling is trivial to do during scheduling, but it > goes hard against the original assumptions made elsewhere. Re-bundling is > also a distinct task that might need to be addressed in this context.The design was intended to support integrated scheduling and bundling before regalloc. The scheduler only needs to set the instructions' "isInsideBundle" flag. Being an early user, you may run into regalloc bugs. If so, we'll try to get them fixed. The trickiest part of this is updating LiveIntervals. We should provide a better API for your bundler with, but I don't want to have that discussion until you completely understand what's available today and are ready to test improvements.> - We do need to have at least a distant plan for global scheduling. BB > scope is nice and manageable, but I can easily get several percent of > "missed" performance by simply implementing a pull-up pass at the very end > of code generation... meaning multiple opportunities were lost earlier. > Current way to express and maintain state needed for global scheduling > remains to be improved.I sympathize. Global scheduling is fun, and superblock scheduling is not hard. Unfortunately, it is not a problem I'm attacking in the foreseeable future (it's not in the "roadmap"). Finding creative ways to compensate is probably worthwhile. An early code motion pass (before coalescing), linearizing the CFG to appear like one block before scheduling, and potentially splitting when you're done scheduling... Anything beyond this will require you to write your own DAG builder. If you're successful at that, and want to contribute back with enough interest from other developers, then we can talk about merging the infrastructure.> - SW pipelining and scheduler interaction with it. When (not if:) we will > have a robust SW pipeliner it will likely to take place before first > scheduling pass, and we do not want to "undo" some decision made there.OK. I think it's a mechanical problem. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120511/b9dcc4dd/attachment.html>