Sjoerd Meijer via llvm-dev
2018-Nov-06 14:19 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
Hello List!
I am looking at top-down vs. bottom-up list scheduling for simple(r) in-order
cores. First, for some context, below is a fairly representative pseudo-code
example of the sort of DSP-like codes I am looking at:
uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) {
uint64_t sum = 0;
while (N-- > 0) {
A1 = *pA++;
A2 = *pA++;
B1 = *pB++;
B2 = *pB++;
...
sum += ((uint64_t) A1 * B1) >> C;
sum += ((uint64_t) A2 * B2) >> C;
...
}
return sum;
}
These kernels are very tight loops. In this sort of legacy codes it's not
uncommon that loops are manually tuned and unrolled with the available
registers in mind, so that all values just about fit in registers. In the
example above, we read from input streams A and B, and do 2 multiply-accumulate
operations. The loop is unrolled 2 times in this example, but 4, 8 or 16 would
more realistic, resulting in very high register pressure.
But legacy apps written in this way are not the only culprit; we see that with
(aggressive) loop unrolling we can end up in exactly the same situation: the
loopunroller does exactly what it needs to do, but it results in very high
register pressure.
Here's the (obvious) problem: all live values in these loops should just
about
fit in registers, but suboptimal/wrong decisions are made resulting in a lot of
spills/reloads; the machine scheduler is making the life of the register
allocator
very difficult. I am looking at regressions of about 30 - 40% for more
than a handful of kernels, and am thus very interested in what I could do about
this.
The first observation is that it looks like we default to bottom-up scheduling.
Starting bottom-up, all these "sum" variables are scheduled first, and
after
that the loads (this is simplifying things a bit). And thus it looks like we
create a lot of very long live-ranges, causing the problems mentioned above.
When instead we start top-down, the loads are picked up first, then the MAC, and
this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC, etc.,
which is let's say a sequence more in program-order, also with shorter
live-ranges. This does exactly what I want for these sort of kernels, it
generates
exactly the code we want.
While playing with this bottom-up/top-down order, it didn't take long to see
that the top-down approach is excellent for these kind of codes, but not for
some other benchmarks, and so solving this issue not just a matter of
defaulting to a new scheduling policy.
I am thinking about prototyping an approach where we start with the bottom-up
approach (under a new misched option), but when a register-pressure/live-range
trashhold value is reached, we bail and fall back to the top-down approach.
This is my first rough idea, but I haven't looked at this problem for very
long. My first few data points is suggesting this might be benificial, but I
could be missing a lot here. And also I am sure that I am looking at an
old/classic problem here and I'm sure others have looked at this problem
before. Thus I am wondering if people have experiences/opinions on this.
Cheers,
Sjoerd.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
Friedman, Eli via llvm-dev
2018-Nov-06 19:19 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
On 11/6/2018 6:19 AM, Sjoerd Meijer via llvm-dev wrote:> Hello List! > > I am looking at top-down vs. bottom-up list scheduling for simple(r) in-order > cores. First, for some context, below is a fairly representative pseudo-code > example of the sort of DSP-like codes I am looking at: > > uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) { > uint64_t sum = 0; > while (N-- > 0) { > A1 = *pA++; > A2 = *pA++; > B1 = *pB++; > B2 = *pB++; > ... > sum += ((uint64_t) A1 * B1) >> C; > sum += ((uint64_t) A2 * B2) >> C; > ... > } > return sum; > } > > These kernels are very tight loops. In this sort of legacy codes it's not > uncommon that loops are manually tuned and unrolled with the available > registers in mind, so that all values just about fit in registers. In the > example above, we read from input streams A and B, and do 2 multiply-accumulate > operations. The loop is unrolled 2 times in this example, but 4, 8 or 16 would > more realistic, resulting in very high register pressure. > > But legacy apps written in this way are not the only culprit; we see that with > (aggressive) loop unrolling we can end up in exactly the same situation: the > loopunroller does exactly what it needs to do, but it results in very high > register pressure. > > Here's the (obvious) problem: all live values in these loops should just about > fit in registers, but suboptimal/wrong decisions are made resulting in a lot of > spills/reloads; the machine scheduler is making the life of the register allocator > very difficult. I am looking at regressions of about 30 - 40% for more > than a handful of kernels, and am thus very interested in what I could do about > this. > > The first observation is that it looks like we default to bottom-up scheduling. > Starting bottom-up, all these "sum" variables are scheduled first, and after > that the loads (this is simplifying things a bit). And thus it looks like we > create a lot of very long live-ranges, causing the problems mentioned above. > When instead we start top-down, the loads are picked up first, then the MAC, and > this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC, etc., > which is let's say a sequence more in program-order, also with shorter > live-ranges. This does exactly what I want for these sort of kernels, it generates > exactly the code we want.Do you know why the existing register pressure heuristics don't work for your testcase with the bottom-up scheduler? -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Matthias Braun via llvm-dev
2018-Nov-06 20:55 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
Some comments:
- It's all heuristics, so you may be lucky/unlycky in any configuration. I
assume you know how to read -debug-only=machine-scheduler output and the
GenericScheduler::tryCandidate() code to figure out how most decisions are made.
- In my experience bottom-up works better when optimizing for register pressure
while top-down tends to work better when optimizing for latencies/stall
reduction.
- The scheduler has a couple rules dealing with pressure. Hitting any of these
limits will make it pick pressure reducing nodes more aggressively (look for
tryPressure() in the tryCandidate() code)
- Hitting the target limits (number of available registers)
- Hitting the maximum of the region before scheduling
- Generally attempt to minimmize pressure (this has lower priority though)
In the past I looked at some crypto code that also was carefully tuned by a
human and that turned out to be very tricky for the heuristics to get right
because most values had multiple users.
At the time I had a patch that would revert all scheduling decisions if it
turned out that the scheduler couldn't find a solution with better or equal
register pressure. It wasn't received too enthusiastically and I ended up
tweaking some heuristics instead. But we may start that discussion again if you
cannot find a know to tune the heuristic for your case.
- Matthias
>
> Hello List!
>
> I am looking at top-down vs. bottom-up list scheduling for simple(r)
in-order
> cores. First, for some context, below is a fairly representative
pseudo-code
> example of the sort of DSP-like codes I am looking at:
>
> uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) {
> uint64_t sum = 0;
> while (N-- > 0) {
> A1 = *pA++;
> A2 = *pA++;
> B1 = *pB++;
> B2 = *pB++;
> ...
> sum += ((uint64_t) A1 * B1) >> C;
> sum += ((uint64_t) A2 * B2) >> C;
> ...
> }
> return sum;
> }
>
> These kernels are very tight loops. In this sort of legacy codes it's
not
> uncommon that loops are manually tuned and unrolled with the available
> registers in mind, so that all values just about fit in registers. In the
> example above, we read from input streams A and B, and do 2
multiply-accumulate
> operations. The loop is unrolled 2 times in this example, but 4, 8 or 16
would
> more realistic, resulting in very high register pressure.
>
> But legacy apps written in this way are not the only culprit; we see that
with
> (aggressive) loop unrolling we can end up in exactly the same situation:
the
> loopunroller does exactly what it needs to do, but it results in very high
> register pressure.
>
> Here's the (obvious) problem: all live values in these loops should
just about
> fit in registers, but suboptimal/wrong decisions are made resulting in a
lot of
> spills/reloads; the machine scheduler is making the life of the register
allocator
> very difficult. I am looking at regressions of about 30 - 40% for more
> than a handful of kernels, and am thus very interested in what I could do
about
> this.
>
> The first observation is that it looks like we default to bottom-up
scheduling.
> Starting bottom-up, all these "sum" variables are scheduled
first, and after
> that the loads (this is simplifying things a bit). And thus it looks like
we
> create a lot of very long live-ranges, causing the problems mentioned
above.
> When instead we start top-down, the loads are picked up first, then the
MAC, and
> this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC,
etc.,
> which is let's say a sequence more in program-order, also with shorter
> live-ranges. This does exactly what I want for these sort of kernels, it
generates
> exactly the code we want.
>
> While playing with this bottom-up/top-down order, it didn't take long
to see
> that the top-down approach is excellent for these kind of codes, but not
for
> some other benchmarks, and so solving this issue not just a matter of
> defaulting to a new scheduling policy.
>
> I am thinking about prototyping an approach where we start with the
bottom-up
> approach (under a new misched option), but when a
register-pressure/live-range
> trashhold value is reached, we bail and fall back to the top-down approach.
> This is my first rough idea, but I haven't looked at this problem for
very
> long. My first few data points is suggesting this might be benificial, but
I
> could be missing a lot here. And also I am sure that I am looking at an
> old/classic problem here and I'm sure others have looked at this
problem
> before. Thus I am wondering if people have experiences/opinions on this.
>
> Cheers,
> Sjoerd.
> IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Sjoerd Meijer via llvm-dev
2018-Nov-07 13:43 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
Hi Matthias and Eli,
Thanks for your replies. Yes, I am first going to study this more, to really
understand all the different bits and pieces involved here. I indeed need to see
if we not accidentally miss something here, or if we are just unlucky with some
heuristics. Your pointers are really useful for that! I will write again when I
have more concrete ideas/plans.
Cheers,
Sjoerd.
________________________________
From: mbraun at apple.com <mbraun at apple.com> on behalf of Matthias
Braun <mbraun at apple.com>
Sent: 06 November 2018 20:55
To: Sjoerd Meijer
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] top-down vs. bottom-up list scheduling
Some comments:
- It's all heuristics, so you may be lucky/unlycky in any configuration. I
assume you know how to read -debug-only=machine-scheduler output and the
GenericScheduler::tryCandidate() code to figure out how most decisions are made.
- In my experience bottom-up works better when optimizing for register pressure
while top-down tends to work better when optimizing for latencies/stall
reduction.
- The scheduler has a couple rules dealing with pressure. Hitting any of these
limits will make it pick pressure reducing nodes more aggressively (look for
tryPressure() in the tryCandidate() code)
- Hitting the target limits (number of available registers)
- Hitting the maximum of the region before scheduling
- Generally attempt to minimmize pressure (this has lower priority though)
In the past I looked at some crypto code that also was carefully tuned by a
human and that turned out to be very tricky for the heuristics to get right
because most values had multiple users.
At the time I had a patch that would revert all scheduling decisions if it
turned out that the scheduler couldn't find a solution with better or equal
register pressure. It wasn't received too enthusiastically and I ended up
tweaking some heuristics instead. But we may start that discussion again if you
cannot find a know to tune the heuristic for your case.
- Matthias
>
> Hello List!
>
> I am looking at top-down vs. bottom-up list scheduling for simple(r)
in-order
> cores. First, for some context, below is a fairly representative
pseudo-code
> example of the sort of DSP-like codes I am looking at:
>
> uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) {
> uint64_t sum = 0;
> while (N-- > 0) {
> A1 = *pA++;
> A2 = *pA++;
> B1 = *pB++;
> B2 = *pB++;
> ...
> sum += ((uint64_t) A1 * B1) >> C;
> sum += ((uint64_t) A2 * B2) >> C;
> ...
> }
> return sum;
> }
>
> These kernels are very tight loops. In this sort of legacy codes it's
not
> uncommon that loops are manually tuned and unrolled with the available
> registers in mind, so that all values just about fit in registers. In the
> example above, we read from input streams A and B, and do 2
multiply-accumulate
> operations. The loop is unrolled 2 times in this example, but 4, 8 or 16
would
> more realistic, resulting in very high register pressure.
>
> But legacy apps written in this way are not the only culprit; we see that
with
> (aggressive) loop unrolling we can end up in exactly the same situation:
the
> loopunroller does exactly what it needs to do, but it results in very high
> register pressure.
>
> Here's the (obvious) problem: all live values in these loops should
just about
> fit in registers, but suboptimal/wrong decisions are made resulting in a
lot of
> spills/reloads; the machine scheduler is making the life of the register
allocator
> very difficult. I am looking at regressions of about 30 - 40% for more
> than a handful of kernels, and am thus very interested in what I could do
about
> this.
>
> The first observation is that it looks like we default to bottom-up
scheduling.
> Starting bottom-up, all these "sum" variables are scheduled
first, and after
> that the loads (this is simplifying things a bit). And thus it looks like
we
> create a lot of very long live-ranges, causing the problems mentioned
above.
> When instead we start top-down, the loads are picked up first, then the
MAC, and
> this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC,
etc.,
> which is let's say a sequence more in program-order, also with shorter
> live-ranges. This does exactly what I want for these sort of kernels, it
generates
> exactly the code we want.
>
> While playing with this bottom-up/top-down order, it didn't take long
to see
> that the top-down approach is excellent for these kind of codes, but not
for
> some other benchmarks, and so solving this issue not just a matter of
> defaulting to a new scheduling policy.
>
> I am thinking about prototyping an approach where we start with the
bottom-up
> approach (under a new misched option), but when a
register-pressure/live-range
> trashhold value is reached, we bail and fall back to the top-down approach.
> This is my first rough idea, but I haven't looked at this problem for
very
> long. My first few data points is suggesting this might be benificial, but
I
> could be missing a lot here. And also I am sure that I am looking at an
> old/classic problem here and I'm sure others have looked at this
problem
> before. Thus I am wondering if people have experiences/opinions on this.
>
> Cheers,
> Sjoerd.
> IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181107/5a24a9a7/attachment.html>
Andrew Trick via llvm-dev
2018-Nov-12 23:08 UTC
[llvm-dev] top-down vs. bottom-up list scheduling
> On Nov 6, 2018, at 6:19 AM, Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hello List! > > I am looking at top-down vs. bottom-up list scheduling for simple(r) in-order > cores. First, for some context, below is a fairly representative pseudo-code > example of the sort of DSP-like codes I am looking at: > > uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) { > uint64_t sum = 0; > while (N-- > 0) { > A1 = *pA++; > A2 = *pA++; > B1 = *pB++; > B2 = *pB++; > ... > sum += ((uint64_t) A1 * B1) >> C; > sum += ((uint64_t) A2 * B2) >> C; > ... > } > return sum; > }Bottom-up scheduling would work for you in this case if the heuristics knew to minimize pressure at every point in the loop. Ideally, that could be detected from the shape of the DAG before list scheduling begins. The closest thing we have to doing that in the current source is the “subtree” scheduling. See `computeDFSResult`. …or maybe you find a simpler way to control the heurstics! I’m not entirely sure why the top-down heuristics are working better for you in this case. What you’re proposing is similar in spirit to LLVM's bidirectional scheduling. The problem is that a list scheduler can dig itself into a hole in either direction so bidirectional doesn’t really save you unless you throw away the previous schedule and start over. LLVM’s normal no-backtracking approach is really designed to avoid compile-time issues. -Andy