Alexey Zhikhartsev via llvm-dev
2018-Nov-13 15:37 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
> > Hi Matthias,> In my experience bottom-up works better when optimizing for registerpressure while top-down tends to work better when optimizing for latencies/stall reduction. That's a very interesting insight, do you mind sharing the intuition behind it? Thanks, Alex From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of> Matthias Braun via llvm-dev > Sent: November-06-18 3:55 PM > To: Sjoerd Meijer <Sjoerd.Meijer at arm.com> > Cc: llvm-dev at lists.llvm.org > Subject: Re: [llvm-dev] top-down vs. bottom-up list scheduling > > Some comments: > - It's all heuristics, so you may be lucky/unlycky in any configuration. I > assume you know how to read -debug-only=machine-scheduler output and the > GenericScheduler::tryCandidate() code to figure out how most decisions are > made. > - In my experience bottom-up works better when optimizing for register > pressure while top-down tends to work better when optimizing for > latencies/stall reduction. > - The scheduler has a couple rules dealing with pressure. Hitting any of > these limits will make it pick pressure reducing nodes more aggressively > (look for tryPressure() in the tryCandidate() code) > - Hitting the target limits (number of available registers) > - Hitting the maximum of the region before scheduling > - Generally attempt to minimmize pressure (this has lower priority > though) > > In the past I looked at some crypto code that also was carefully tuned by > a human and that turned out to be very tricky for the heuristics to get > right because most values had multiple users. > At the time I had a patch that would revert all scheduling decisions if it > turned out that the scheduler couldn't find a solution with better or equal > register pressure. It wasn't received too enthusiastically and I ended up > tweaking some heuristics instead. But we may start that discussion again if > you cannot find a know to tune the heuristic for your case. > > - Matthias > > > > > Hello List! > > > > I am looking at top-down vs. bottom-up list scheduling for simple(r) > > in-order cores. First, for some context, below is a fairly > > representative pseudo-code example of the sort of DSP-like codes I am > looking at: > > > > uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) { > > uint64_t sum = 0; > > while (N-- > 0) { > > A1 = *pA++; > > A2 = *pA++; > > B1 = *pB++; > > B2 = *pB++; > > ... > > sum += ((uint64_t) A1 * B1) >> C; > > sum += ((uint64_t) A2 * B2) >> C; > > ... > > } > > return sum; > > } > > > > These kernels are very tight loops. In this sort of legacy codes it's > > not uncommon that loops are manually tuned and unrolled with the > > available registers in mind, so that all values just about fit in > > registers. In the example above, we read from input streams A and B, > > and do 2 multiply-accumulate operations. The loop is unrolled 2 times > > in this example, but 4, 8 or 16 would more realistic, resulting in very > high register pressure. > > > > But legacy apps written in this way are not the only culprit; we see > > that with > > (aggressive) loop unrolling we can end up in exactly the same > > situation: the loopunroller does exactly what it needs to do, but it > > results in very high register pressure. > > > > Here's the (obvious) problem: all live values in these loops should > > just about fit in registers, but suboptimal/wrong decisions are made > > resulting in a lot of spills/reloads; the machine scheduler is making > > the life of the register allocator very difficult. I am looking at > > regressions of about 30 - 40% for more than a handful of kernels, and > > am thus very interested in what I could do about this. > > > > The first observation is that it looks like we default to bottom-up > scheduling. > > Starting bottom-up, all these "sum" variables are scheduled first, and > > after that the loads (this is simplifying things a bit). And thus it > > looks like we create a lot of very long live-ranges, causing the > problems mentioned above. > > When instead we start top-down, the loads are picked up first, then > > the MAC, and this repeated for all MACs. The result is a sequence LD, > > MAC, LD, MAC, etc., which is let's say a sequence more in > > program-order, also with shorter live-ranges. This does exactly what I > > want for these sort of kernels, it generates exactly the code we want. > > > > While playing with this bottom-up/top-down order, it didn't take long > > to see that the top-down approach is excellent for these kind of > > codes, but not for some other benchmarks, and so solving this issue > > not just a matter of defaulting to a new scheduling policy. > > > > I am thinking about prototyping an approach where we start with the > > bottom-up approach (under a new misched option), but when a > > register-pressure/live-range trashhold value is reached, we bail and > fall back to the top-down approach. > > This is my first rough idea, but I haven't looked at this problem for > > very long. My first few data points is suggesting this might be > > benificial, but I could be missing a lot here. And also I am sure that > > I am looking at an old/classic problem here and I'm sure others have > > looked at this problem before. Thus I am wondering if people have > experiences/opinions on this. > > > > Cheers, > > Sjoerd. > > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181113/cb17a094/attachment.html>
Andrew Trick via llvm-dev
2018-Nov-14 16:42 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
> On Nov 13, 2018, at 7:37 AM, Alexey Zhikhartsev via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi Matthias, > > > In my experience bottom-up works better when optimizing for register pressure while top-down tends to work better when optimizing for latencies/stall reduction. > > That's a very interesting insight, do you mind sharing the intuition behind it? > > Thanks, > AlexIn the broadest generalization of the scheduling problem, if your DAG is shaped like a tree, then the number of available nodes at the beginning and end of scheduling will be affected by the order that you schedule. If you schedule the “pointy” end of the tree first (aka the root), the there aren’t many choices at first, and the leaves gradually become available at the end of scheduling. If you schedule the leaves first, then the heuristic must choose from many available nodes at once. When scheduling for resource utilization, it’s hard to correctly choose among many nodes that use the same resources. Scheduling the root first means less chance to get things wrong early on. In fact, if the DAG is strictly a tree, then bottom-up scheduling with Sethi-Ullman numbers easily minimizes registers. Of course, DAG’s aren’t actually tress, so there’s no clear right or wrong direction to schedule, but bottom-up tends to be pretty effective at minimizing registers if that’s the main constraint. When scheduling for latency, you just need the critical path from each node, which is computed before scheduling begins and doesn’t change. So having more nodes available at first doesn’t hurt. -Andy> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org <mailto:llvm-dev-bounces at lists.llvm.org>] On Behalf Of Matthias Braun via llvm-dev > Sent: November-06-18 3:55 PM > To: Sjoerd Meijer <Sjoerd.Meijer at arm.com <mailto:Sjoerd.Meijer at arm.com>> > Cc: llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] top-down vs. bottom-up list scheduling > > Some comments: > - It's all heuristics, so you may be lucky/unlycky in any configuration. I assume you know how to read -debug-only=machine-scheduler output and the GenericScheduler::tryCandidate() code to figure out how most decisions are made. > - In my experience bottom-up works better when optimizing for register pressure while top-down tends to work better when optimizing for latencies/stall reduction. > - The scheduler has a couple rules dealing with pressure. Hitting any of these limits will make it pick pressure reducing nodes more aggressively (look for tryPressure() in the tryCandidate() code) > - Hitting the target limits (number of available registers) > - Hitting the maximum of the region before scheduling > - Generally attempt to minimmize pressure (this has lower priority though) > > In the past I looked at some crypto code that also was carefully tuned by a human and that turned out to be very tricky for the heuristics to get right because most values had multiple users. > At the time I had a patch that would revert all scheduling decisions if it turned out that the scheduler couldn't find a solution with better or equal register pressure. It wasn't received too enthusiastically and I ended up tweaking some heuristics instead. But we may start that discussion again if you cannot find a know to tune the heuristic for your case. > > - Matthias > > > > > Hello List! > > > > I am looking at top-down vs. bottom-up list scheduling for simple(r) > > in-order cores. First, for some context, below is a fairly > > representative pseudo-code example of the sort of DSP-like codes I am looking at: > > > > uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) { > > uint64_t sum = 0; > > while (N-- > 0) { > > A1 = *pA++; > > A2 = *pA++; > > B1 = *pB++; > > B2 = *pB++; > > ... > > sum += ((uint64_t) A1 * B1) >> C; > > sum += ((uint64_t) A2 * B2) >> C; > > ... > > } > > return sum; > > } > > > > These kernels are very tight loops. In this sort of legacy codes it's > > not uncommon that loops are manually tuned and unrolled with the > > available registers in mind, so that all values just about fit in > > registers. In the example above, we read from input streams A and B, > > and do 2 multiply-accumulate operations. The loop is unrolled 2 times > > in this example, but 4, 8 or 16 would more realistic, resulting in very high register pressure. > > > > But legacy apps written in this way are not the only culprit; we see > > that with > > (aggressive) loop unrolling we can end up in exactly the same > > situation: the loopunroller does exactly what it needs to do, but it > > results in very high register pressure. > > > > Here's the (obvious) problem: all live values in these loops should > > just about fit in registers, but suboptimal/wrong decisions are made > > resulting in a lot of spills/reloads; the machine scheduler is making > > the life of the register allocator very difficult. I am looking at > > regressions of about 30 - 40% for more than a handful of kernels, and > > am thus very interested in what I could do about this. > > > > The first observation is that it looks like we default to bottom-up scheduling. > > Starting bottom-up, all these "sum" variables are scheduled first, and > > after that the loads (this is simplifying things a bit). And thus it > > looks like we create a lot of very long live-ranges, causing the problems mentioned above. > > When instead we start top-down, the loads are picked up first, then > > the MAC, and this repeated for all MACs. The result is a sequence LD, > > MAC, LD, MAC, etc., which is let's say a sequence more in > > program-order, also with shorter live-ranges. This does exactly what I > > want for these sort of kernels, it generates exactly the code we want. > > > > While playing with this bottom-up/top-down order, it didn't take long > > to see that the top-down approach is excellent for these kind of > > codes, but not for some other benchmarks, and so solving this issue > > not just a matter of defaulting to a new scheduling policy. > > > > I am thinking about prototyping an approach where we start with the > > bottom-up approach (under a new misched option), but when a > > register-pressure/live-range trashhold value is reached, we bail and fall back to the top-down approach. > > This is my first rough idea, but I haven't looked at this problem for > > very long. My first few data points is suggesting this might be > > benificial, but I could be missing a lot here. And also I am sure that > > I am looking at an old/classic problem here and I'm sure others have > > looked at this problem before. Thus I am wondering if people have experiences/opinions on this. > > > > Cheers, > > Sjoerd. > > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181114/0a33594b/attachment.html>