Just wanted to chime in here and say that having a good robust
specialization pass would be very valuable. As you highlight though,
the cost modeling is the tricky bit. Getting a cost model worked
through that works well in practice - in particular, when combined with
the existing inliner cost model - is a major task. I suspect, without
much evidence, that you'll end up having the balance the cost model for
specialization against the cost model for inlining - and thus probably
have to adjust both. That is, to say the least, challenging. If you're
willing to take on that work, I'll simply say thank you and good luck. :)
One thing you could consider is exploring the notion of "zero cost"
specialization. The analogy I'd use is the cost model we use for the
new loop unswitch. The old loop unswitch was pretty aggressive, whereas
the new one requires the transformation to recover (almost all of) the
cost of the transformed loop overhead.
I've spent some time in the past thinking about what a zero cost
specializer might look like. I think there's some very interesting
possibilities looking at it through the lens of function splitting
combined with musttail dispatch. Let me give a small example using one
of your motivating routines. Starting with:
long compute(long x, long (*binop)(long) ) {
return binop(x);
}
int foo(int x, int y) {
if (y)
return compute(x, plus);
else
return compute(x, minus);
}
This can become (at zero cost):
long compute(long x, long (*binop)(long) ) {
return binop(x);
}
long computeA(long x) { return compute(x, plus); }
long computeB(long x) { return compute(x, minus); }
int foo(int x, int y) {
if (y)
return musttail computeA(x);
else
return musttail computeB(x);
}
If you think about what that lowers to, foo ends up simply being the
branch dispatch to the other routines. There's some space overhead due
to function alignment, and lost fallthrough opportunity, but that's it.
Why bother? Well, on this example, it's hard to justify because the
inliner would happily blow through the whole thing, but on larger
examples we've made the dispatch in foo dramatically easier for the
caller to inline. That's particular interesting when a subset of
callers have analyzable or predictable (in the branch predictor sense)
values for y.
Anyways, just a random through dump. If any of this is useful, feel
free to use, if not, feel free to ignore. :)
Philip
On 3/23/21 12:44 PM, Sjoerd Meijer via llvm-dev wrote:>
> I am interested in adding a function specialisation(*) pass to LLVM.
> This frequently
> comes up as a reason for performance differences of generated code between
> GCC and LLVM. In GCC this is implemented in [1] and gets enabled at
> optimisation level -03 and up. There have been two previous attempts
> in adding
> this to LLVM: a while back this was attempted in [2] and very recently
> in [3].
>
> Both previous attempts were parked at approximately the same point: the
> transformation was implemented but the cost-model to control compile-times
> and code-size was lacking. This is crucial as the goal would be to have
> function specialisation enabled by default (at some opt level) and
> function
> cloning has of course great potential to increase code-size and the
> interprocedural analysis isn't very cheap. Thus, I agree with previous
> comments
> on both tickets that we probably need to have a solid story for this
> before it makes
> sense to add this to the LLVM code-base.
>
> Given that GCC has this enabled by default we are in an excellent
> position to
> evaluate the compile-time/code-size overhead of function
> specialiation. For two
> cases I have tried investigating this overhead (and didn't evaluate
> performance of
> the generated code). First, I took SQLite's amalgamation version, i.e.
> 1 source file
> that is "220,000 lines long and over 7.5 megabytes in size" [4]
as
> that seemed like a
> good stress test for an IPA transformation pass to me. Comparing GCC
> 10.2.0
> with -O3 and "-O3 -fdisable-ipa-cp" totoggle this on and off,
> respectively,
> showed a 0.3% - 1.5% compile-time difference on different systems and a
> 1.3% object file increase. Second, I timed compilation of LLVM,
> representing a
> very different codebase. For LLMV compile time differences were all
> within noise
> levels with only a tiny code-size increase. I haven't looked into
> details here, but I
> guess that it is not doing much, which could be the benefit of a
> well-tuned
> cost-model, so is a good result.
>
> Another reason why I am widening the audience with this email is that
> alternative approaches were floated. In [2] it was remarked that "we
> could
> propagate constant sets indicating the functions indirect call sites
> could possibly
> target. Although we would probably want to limit the size of the sets
> to something
> small, the pass could attach the sets via metadata to the calls so
> that this information
> could be consumed by later passes. Such metadata could be used for
> indirect call
> promotion, intersecting the function attributes of the possible
targets".
>
> But I am reluctant to go for this let's say more experimental approach
> as the GCC
> numbers are very encouraging. I.e., both compile-time and code-size
> seem very
> reasonable and I don't have seen yet any reasons to abandon the
> approaches in
> [2] and [3] which are very similar. So, the approach I would like to
> take is complement
> [3] with an analysis of the added compile-time/code-size increase, and
> then propose a
> cost-model which then hopefully gives results in the range of GCC; I
> think that is what
> we should be aiming at. But I am interested to hear if there are
> more/other opinions on
> this.
>
> (*) Motivating examples for function specialisation in previous works
> were e.g.:
>
> long plus(long x) {
> return x + 1;
> }
> long minus(long x) {
> return x - 1;
> }
> long compute(long x, long (*binop)(long) ) {
> return binop(x);
> }
> int foo(int x, int y) {
> if (y)
> return compute(x, plus);
> else
> return compute(x, minus);
> }
>
> If we specialise compute which gets inlined in foo, we end up with
> just a return x + 1
> and return x -1 in foo. Other motivating examples pass constants to
> functions, which
> can then get propagated after function specialisation.
>
>
> [1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ipa-cp.c
> <https://github.com/gcc-mirror/gcc/blob/master/gcc/ipa-cp.c>
> [2] https://reviews.llvm.org/D36432 <https://reviews.llvm.org/D36432>
> [3] https://reviews.llvm.org/D93838 <https://reviews.llvm.org/D93838>
> [4] https://www.sqlite.org/amalgamation.html
> <https://www.sqlite.org/amalgamation.html>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210326/7d6e6a77/attachment.html>