Jay Foad via llvm-dev
2019-Sep-09 11:22 UTC
[llvm-dev] Fwd: MachineScheduler not scheduling for latency
Hi, I'm trying to understand why MachineScheduler does a poor job in straight line code in cases like the one in the attached debug dump. This is on AMDGPU, an in-order target, and the problem is that the IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in the resulting schedule they are often placed right next to their uses like this: 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable load 4 from custom TargetCustom8) 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec ... 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable load 4 from custom TargetCustom8) 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec This can be improved slightly in post-ra scheduling, but not much. The post-ra scheduler simply doesn't have enough freedom to move instructions around once physical registers have been assigned, so I contend that MachineScheduler needs to consider latency. I've looked at what's going on in the debugger and the problem seems to be that GenericSchedulerBase::setPolicy does not set Policy.ReduceLatency because it thinks that the other zone (Top in this case) is issue limited. There are lots of things I don't understand here: 1. "Issue limited" seems to mean that the number of instructions is greater than the length of the critical path. I don't really understand why this is an interesting criterion. It seems to me like a fairly normal state of affairs. 2. Why does the fact that it's issue limited mean that it's a good idea to turn off the latency heuristic? Moving instructions around can't really change whether the function is issue limited or not, but it can definitely improve latency problems. 3. Why do we completely turn off the latency heuristic, rather than just letting it take its usual (very low) priority in GenericScheduler::tryCandidate? 4. Stepping back a bit, is MachineScheduler really trying to consider latency at all in pre-ra mode? I see a comment that it "Schedules aggressively for latency in PostRA mode", but what about pre-ra? Of course we can and do override some of the generic logic in our target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going further down that route I'd like to try to understand the intent of the generic logic. Thanks for any insights, Jay. -------------- next part -------------- A non-text attachment was scrubbed... Name: machine-scheduler.txt.gz Type: application/gzip Size: 62712 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/08853269/attachment-0001.bin>
Andrew Trick via llvm-dev
2019-Sep-09 18:36 UTC
[llvm-dev] MachineScheduler not scheduling for latency
> On Sep 9, 2019, at 4:22 AM, Jay Foad <jay.foad at gmail.com> wrote: > > Hi, > > I'm trying to understand why MachineScheduler does a poor job in > straight line code in cases like the one in the attached debug dump. > This is on AMDGPU, an in-order target, and the problem is that the > IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in > the resulting schedule they are often placed right next to their uses > like this: > > 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, > %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec > :: (dereferenceable load 4 from custom TargetCustom8) > 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec > ... > 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, > %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec > :: (dereferenceable load 4 from custom TargetCustom8) > 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec > > This can be improved slightly in post-ra scheduling, but not much. The > post-ra scheduler simply doesn't have enough freedom to move > instructions around once physical registers have been assigned, so I > contend that MachineScheduler needs to consider latency. > > I've looked at what's going on in the debugger and the problem seems > to be that GenericSchedulerBase::setPolicy does not set > Policy.ReduceLatency because it thinks that the other zone (Top in > this case) is issue limited. There are lots of things I don't > understand here: > > 1. "Issue limited" seems to mean that the number of instructions is > greater than the length of the critical path. I don't really > understand why this is an interesting criterion. It seems to me like a > fairly normal state of affairs. > 2. Why does the fact that it's issue limited mean that it's a good > idea to turn off the latency heuristic? Moving instructions around > can't really change whether the function is issue limited or not, but > it can definitely improve latency problems. > 3. Why do we completely turn off the latency heuristic, rather than > just letting it take its usual (very low) priority in > GenericScheduler::tryCandidate? > 4. Stepping back a bit, is MachineScheduler really trying to consider > latency at all in pre-ra mode? I see a comment that it "Schedules > aggressively for latency in PostRA mode", but what about pre-ra? > > Of course we can and do override some of the generic logic in our > target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going > further down that route I'd like to try to understand the intent of > the generic logic. > > Thanks for any insights, > Jay. > <machine-scheduler.txt.gz>Hi Jay, General disclaimer: The GenericScheduler is intended to exercise the key features that cover most of what backends will need, and have heuristics that are good enough for a typical out-of-order CPU. The intent was always that backend developers who care enough about scheduling will plugin a target-specific MachineSchedulingStrategy. The GenericScheduler also tends to take a "do no harm" approach and leave things in source order until it can determine that scheduling will help. That said... Yes, pre-RA is definitely the place to hide latency, but GenericScheduler currently does it very conservatively when MicroOpBufferSize > 1. Backends that need aggressive latency hiding will set MicroOpBufferSize = 1. I would expect you to be doing that for your target. One reason latency is suppressed/ignored at times with MicroOpBufferSize > 1 is that some loops make good use of registers when left in source order, but exhibit spill code as soon as the scheduler starts shuffling things. "Issue limited" means that forming instructions groups is on average more important than reducing the critical path. IsResourceLimited is similar but across all modeled processor resources. As you said, that doesn't imply that latency is irrelevant. That's probably being controlled by 'shouldReduceLatency', which looks to me like it suppresses latency hiding until the critical path becomes relevant. My guess is that it's extremely conservative about latency hiding because defaulting to latency hiding can get you backed into a corner with register pressure. Again, shouldReduceLatency should not be relevant at MicroOpBufferSize = 1, because the scheduler should prioritize issue stalls over other heuristics. Looking at the key decision point in your trace, I see the scheduler is being driven by the `REG-CRIT` heuristic. So, it thinks that failing to scheduled the use-def together would increase pressure in a critical register class. There's no expectation that the GenericScheduler heuristics are anywhere near as good as they can be over the universe of workloads, but changing those heuristiscs means checking in with current backend maintainers to be sure you're not regressing their benchmarks in the process. I haven't touched these heuristcs in 6 years, but several others have had a hand in it recently. -Andy
Jay Foad via llvm-dev
2019-Sep-10 14:32 UTC
[llvm-dev] MachineScheduler not scheduling for latency
Hi Andy, Thanks for the explanations. Yes AMDGPU is in-order and has MicroOpBufferSize = 1. Re "issue limited" and instruction groups: could it make sense to disable the generic scheduler's detection of issue limitation on in-order CPUs, or on CPUs that don't define instruction groups, or some similar condition? Something like: --- a/lib/CodeGen/MachineScheduler.cpp +++ b/lib/CodeGen/MachineScheduler.cpp @@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) { if (!SchedModel->hasInstrSchedModel()) return 0; - unsigned OtherCritCount = Rem->RemIssueCount - + (RetiredMOps * SchedModel->getMicroOpFactor()); - LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: " - << OtherCritCount / SchedModel->getMicroOpFactor() << '\n'); + unsigned OtherCritCount = 0; + if (some condition) { + OtherCritCount = Rem->RemIssueCount + + (RetiredMOps * SchedModel->getMicroOpFactor()); + LLVM_DEBUG(dbgs() << " " << Available.getName() << " + Remain MOps: " + << OtherCritCount / SchedModel->getMicroOpFactor() << '\n'); + } for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds(); PIdx != PEnd; ++PIdx) { unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx]; As for "shouldReduceLatency should not be relevant at MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency should effectively be changed to always return true on in-order CPUs? Even with that change, latency comes pretty far down the list of criteria tested in GenericScheduler::tryCandidate. Thanks, Jay. On Mon, 9 Sep 2019 at 19:36, Andrew Trick <atrick at apple.com> wrote:> > > > On Sep 9, 2019, at 4:22 AM, Jay Foad <jay.foad at gmail.com> wrote: > > > > Hi, > > > > I'm trying to understand why MachineScheduler does a poor job in > > straight line code in cases like the one in the attached debug dump. > > This is on AMDGPU, an in-order target, and the problem is that the > > IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in > > the resulting schedule they are often placed right next to their uses > > like this: > > > > 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, > > %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec > > :: (dereferenceable load 4 from custom TargetCustom8) > > 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec > > ... > > 1784B %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64, > > %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec > > :: (dereferenceable load 4 from custom TargetCustom8) > > 1792B %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec > > > > This can be improved slightly in post-ra scheduling, but not much. The > > post-ra scheduler simply doesn't have enough freedom to move > > instructions around once physical registers have been assigned, so I > > contend that MachineScheduler needs to consider latency. > > > > I've looked at what's going on in the debugger and the problem seems > > to be that GenericSchedulerBase::setPolicy does not set > > Policy.ReduceLatency because it thinks that the other zone (Top in > > this case) is issue limited. There are lots of things I don't > > understand here: > > > > 1. "Issue limited" seems to mean that the number of instructions is > > greater than the length of the critical path. I don't really > > understand why this is an interesting criterion. It seems to me like a > > fairly normal state of affairs. > > 2. Why does the fact that it's issue limited mean that it's a good > > idea to turn off the latency heuristic? Moving instructions around > > can't really change whether the function is issue limited or not, but > > it can definitely improve latency problems. > > 3. Why do we completely turn off the latency heuristic, rather than > > just letting it take its usual (very low) priority in > > GenericScheduler::tryCandidate? > > 4. Stepping back a bit, is MachineScheduler really trying to consider > > latency at all in pre-ra mode? I see a comment that it "Schedules > > aggressively for latency in PostRA mode", but what about pre-ra? > > > > Of course we can and do override some of the generic logic in our > > target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going > > further down that route I'd like to try to understand the intent of > > the generic logic. > > > > Thanks for any insights, > > Jay. > > <machine-scheduler.txt.gz> > > > Hi Jay, > > General disclaimer: The GenericScheduler is intended to exercise the key features that cover most of what backends will need, and have heuristics that are good enough for a typical out-of-order CPU. The intent was always that backend developers who care enough about scheduling will plugin a target-specific MachineSchedulingStrategy. The GenericScheduler also tends to take a "do no harm" approach and leave things in source order until it can determine that scheduling will help. > > That said... > > Yes, pre-RA is definitely the place to hide latency, but GenericScheduler currently does it very conservatively when MicroOpBufferSize > 1. Backends that need aggressive latency hiding will set MicroOpBufferSize = 1. I would expect you to be doing that for your target. > > One reason latency is suppressed/ignored at times with MicroOpBufferSize > 1 is that some loops make good use of registers when left in source order, but exhibit spill code as soon as the scheduler starts shuffling things. > > "Issue limited" means that forming instructions groups is on average more important than reducing the critical path. IsResourceLimited is similar but across all modeled processor resources. As you said, that doesn't imply that latency is irrelevant. That's probably being controlled by 'shouldReduceLatency', which looks to me like it suppresses latency hiding until the critical path becomes relevant. My guess is that it's extremely conservative about latency hiding because defaulting to latency hiding can get you backed into a corner with register pressure. > > Again, shouldReduceLatency should not be relevant at MicroOpBufferSize = 1, because the scheduler should prioritize issue stalls over other heuristics. > > Looking at the key decision point in your trace, I see the scheduler is being driven by the `REG-CRIT` heuristic. So, it thinks that failing to scheduled the use-def together would increase pressure in a critical register class. > > There's no expectation that the GenericScheduler heuristics are anywhere near as good as they can be over the universe of workloads, but changing those heuristiscs means checking in with current backend maintainers to be sure you're not regressing their benchmarks in the process. I haven't touched these heuristcs in 6 years, but several others have had a hand in it recently. > > -Andy >