thr3ads.net - llvm dev - [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Quentin Colombet via llvm-dev

2017-Jul-25 01:24 UTC

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi River,

Thanks for the detailed explanation.

If people are okay for you to move forward, like I said to Andrey, I won’t
oppose. I feel sad we have to split our effort on outlining technology, but I
certainly don’t pretend to know what is best!
The bottom line is if people are happy with that going in, the conversation on
the details can continue in parallel.
> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com>
wrote:
> 
> Hey Quentin,
>  Sorry I missed the last question. Currently it doesn't do continual
outlining, but it's definitely a possibility.
Ok, that means we probably won’t have the very bad runtime problems I had in
mind with adding a lot of indirections.
> Thanks,
> River Riddle
> 
> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at gmail.com
<mailto:riddleriver at gmail.com>> wrote:
> Hi Quentin,
>   I understand your points and I believe that some meaning is being lost
via email. For performance it's true that that cost isn't necessarily
modeled, there is currently only support for using profile data to avoid
mitigate that. I was working under the assumption, possibly incorrectly, that at
Oz we favor small code over anything else including runtime performance. This is
why we only run at Os if we have profile data, and even then we are very
conservative about what we outline from.
>   I also understand that some target hooks may be required for certain
things, it happens for other IR level passes as well. I just want to minimize
that behavior as much as I can, though thats a personal choice.
> As for a motivating reason to push for this, it actually solves some of the
problems that you brought up in the post for the original Machine Outliner RFC.
If I can quote you:
> 
> "So basically, better cost model. That's one part of the story.
> 
> The other part is at the LLVM IR level or before register allocation
identifying similar code sequence is much harder, at least with a suffix tree
like algorithm. Basically the problem is how do we name our instructions such
that we can match them.
> Let me take an example.
> foo() {
> /* bunch of code */
> a = add b, c;
> d = add e, f; 
> }
> 
> bar() {
> d = add e, g;
> f = add c, w;
> }
> 
> With proper renaming, we can outline both adds in one function. The
difficulty is to recognize that they are semantically equivalent to give them
the same identifier in the suffix tree. I won’t get into the details but it gets
tricky quickly. We were thinking of reusing GVN to have such identifier if we
wanted to do the outlining at IR level but solving this problem is hard."
> 
> This outliner will catch your case, it is actually one of the easiest cases
for it to catch. The outliner can recognize semantically equivalent instructions
and can be expanded to catch even more.
Interesting, could you explain how you do that?
I didn’t see it in the original post.
> 
> As for the cost model it is quite simple:
>  - We identify all of the external inputs into the sequence. For estimating
the benefit we constant fold and condense the inputs so that we can get the set
of *unique* inputs into the sequence.
Ok, those are your parameters. How do you account for the cost of setting up
those parameters?
>  - We also identify outputs from the sequence, instructions that have
external references. We add the cost of storing/loading/passing output parameter
to the outlined function.
Ok, those are your return values (or your output parameter). I see the cost
computation you do on those, but it still miss the general parameter setup cost.
>  - We identify one output to promote to a return value for the function.
This can end up reducing an output parameter that is passed to our outlined
function.
How do you choose that one?
>  - We estimate the cost of the sequence of instructions by mostly using
TTI.
Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>  - With the above we can estimate the cost per occurrence as well as the
cost for the new function. Of course these are mostly estimates, we lean towards
the conservative side to be safe.
Conservative in what sense? Put differently how do you know your cost is
conservative?
>  - Finally we can compute an estimated benefit for the sequence taking into
account benefit per occurrence and the estimated cost of the new function.
Make sense.
> 
> There is definitely room for improvement in the cost model. I do not
believe its the best but its shown to be somewhat reliable for most cases. It
has benefits and it has regressions, as does the machine outliner.
Regressions in what sense?
Do you actually have functions that are bigger?

To clarify, AFAIR, the machine outliner does not have such regressions per say.
The outliner does perfectly what the cost model predicted: functions are
individually smaller. Code size grow may happen because of padding in the object
file (and getting in the way of some linker optimization).
> The reasoning for adding the new framework, is because I believe that this
transformation should exist at the IR level. Not only because it is the simplest
place to put it but also because it provides the greatest opportunities for
improvement with the largest coverage.
Again I don’t get those arguments. Machine passes can be target independent.
Sure they may require the targets to adopt new hooks but that’s about it and I
don’t see how this is different from having to add adopt hooks from TTI. I am
guessing you can mostly achieve your goals with the existing TTI hooks, but the
arguments passing ones, which is the part I missed in your explanation :).
> We can expand the framework to catch Danny's cases from the machine
outliner RFC, we can add region outlining, etc. We can do all of this and make
it available to every target automatically. The reason why I am for this being
at the IR level is not because I want to split up the technical effort, but
combine it.
I fail to see the combining part here :).

To give you a bit context, I am afraid that, indeed, the implementation is going
to be simpler (partly because a lot of people are more familiar with LLVM IR
than MachineInstr) and for the most part TTI is going to be good enough.
However, when we will want to go the extra mile and do crazy stuff, we will,
like with the vectorizer IMHO, hit that wall of TTI  abstractions that we are
going to work around in various ways. I’d like, if at all possible, to avoid
repeating the history here.

Cheers,
-Quentin
> 
> Thanks,
>   River Riddle
> 
> 
> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
> Hi River,
> 
>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com
<mailto:riddleriver at gmail.com>> wrote:
>> 
>> Hi Quentin,
>>  I appreciate the feedback. When I reference the cost of Target Hooks
it's mainly for maintainability and cost on a target author. We want to keep
the intrusion into target information minimized. The heuristics used for the
outliner are the same used by any other IR level pass seeking target
information, i.e TTI for the most part. I can see where you are coming from with
"having heuristics solely focused on code size do not seem realistic",
but I don't agree with that statement.
> 
> If you only want code size I agree it makes sense, but I believe, even in
Oz, we probably don’t want to slow the code by a big factor for a couple bytes.
That’s what I wanted to say and what I wanted to point out is that you need to
have some kind of model for the performance to avoid those worst cases. Unless
we don’t care :).
> 
>> I think there is a disconnect on heuristics. The only user tunable
parameters are the lower bound parameters(to the cost model), the actual
analysis(heuristic calculation) is based upon TTI information.
> 
> I don’t see how you can get around adding more hooks to know how a specific
function prototype is going to be lowered (e.g., i64 needs to be split into two
registers, fourth and onward parameters need to be pushed on the stack and so
on). Those change the code size benefit.
> 
>>  When you say "Would still be interesting to see how well this
could perform on some exact model (i.e., at the Machine level), IMO." I am
slightly confused as to what you mean. I do not intend to try and implement this
algorithm at the MIR level given that it exists in Machine Outliner.
> 
> Of course, I don’t expect you to do that :). What I meant is that the claim
that IR offers the better trade off is not based on hard evidences. I actually
don’t buy it.
> 
> My point was to make sure I understand what you are trying to solve and
given you have mentioned the MachineOutliner, why you are not working on
improving it instead of suggesting a new framework.
> Don’t take me wrong, maybe creating a new framework at the IR level is the
right thing to do, but I still didn’t get that from your comments.
> 
>> There are several comparison benchmarks given in the "More
detailed performance data" of the original RFC. It includes comparisons to
the Machine Outliner when possible(I can't build clang on Linux with Machine
Outliner). I welcome any and all discussion on the placement of the outliner in
LLVM.
> 
> My fear with a new framework is that we are going to split the effort for
pushing the outliner technology forward and I’d like to avoid that if at all
possible.
> 
> Now, to be more concrete on your proposal, could you describe the cost
model for deciding what to outline? (Really the cost model, not the suffix
algo.)
> Are outlined functions pushed into the list candidates for further
outlining?
> 
> Cheers,
> -Quentin
> 
>>  Thanks,
>> River Riddle
>> 
>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>> Hi River,
>> 
>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>> 
>>> Hi Jessica,
>>>  The comparison to the inliner is an interesting one but we think
it's important to note the difference in the use of heuristics. The inliner
is juggling many different tasks at the same time, execution speed, code size,
etc. which can cause the parameters to be very sensitive depending on the
benchmark/platform/etc. The outliners heuristics are focused solely on the
potential code size savings from outlining, and is thus only sensitive to the
current platform. This only creates a problem when we are over estimating the
potential cost of a set of instructions for a particular target. The cost model
parameters are only minimums: instruction sequence length, estimated benefit,
occurrence amount. The heuristics themselves are conservative and based upon all
of the target information available at the IR level, the parameters are just
setting a lower bound to weed out any outliers. You are correct in that being at
the machine level, before or after RA, will give the most accurate heuristics
but we feel there's an advantage to being at the IR level. At the IR level
we can do so many more things that are either too difficult/complex for the
machine level(e.g parameterization/outputs/etc). Not only can we do these things
but they are available on all targets immediately, without the need for target
hooks. The caution on the use of heuristics is understandable, but there comes a
point when trade offs need to be made. We made the trade off for a loss in exact
cost modeling to gain flexibility, coverage, and potential for further features.
This trade off is the same made for quite a few IR level optimizations,
including inlining. As for the worry about code size regressions, so far the
results seem to support our hypothesis.
>> 
>> Would still be interesting to see how well this could perform on some
exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
choosing an implementation because it is simpler might not be the right long
term solution.
>> At the very least, to know what trade-off we are making, having
prototypes with the different approaches sounds sensible.
>> In particular, all the heuristics about cost for parameter passing
(haven’t checked how you did it) sounds already complex enough and would require
target hooks. Therefore, I am not seeing a clear win with an IR approach here.
>> 
>> Finally, having heuristics solely focused on code size do not seem
realistic. Indeed, I am guessing you have some thresholds to avoid outlining
some piece of code too small that would end up adding a whole lot of
indirections and I don’t like magic numbers in general :).
>> 
>> To summarize, I wanted to point out that an IR approach is not as a
clear win as you describe and would thus deserve more discussion.
>> 
>> Cheers,
>> -Quentin
>> 
>>>  Thanks,
>>> River Riddle
>>> 
>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at
apple.com <mailto:jpaquette at apple.com>> wrote:
>>> 
>>> Hi River,
>>> 
>>> I’m working on the MachineOutliner pass at the MIR level. Working
at the IR level sounds interesting! It also seems like our algorithms are
similar. I was thinking of taking the suffix array route with the
MachineOutliner in the future.
>>> 
>>> Anyway, I’d like to ask about this:
>>> 
>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> The downside to having this type of transformation be at the IR
level is it means there will be less accuracy in the cost model -  we can
somewhat accurately model the cost per instruction but we can’t get information
on how a window of instructions may lower. This can cause regressions depending
on the platform/codebase, therefore to help alleviate this there are several
tunable parameters for the cost model.
>>> 
>>> The inliner is threshold-based and it can be rather unpredictable
how it will impact the code size of a program. Do you have any idea as to how
heuristics/thresholds/parameters could be tuned to prevent this? In my
experience, making good code size decisions with these sorts of passes requires
a lot of knowledge about what instructions you’re dealing with exactly. I’ve
seen the inliner cause some pretty serious code size regressions in projects due
to small changes to the cost model/parameters which cause improvements in other
projects. I’m a little worried that an IR-level outliner for code size would be
doomed to a similar fate.
>>> 
>>> Perhaps it would be interesting to run this sort of pass
pre-register allocation? This would help pull you away from having to use
heuristics, but give you some more opportunities for finding repeated
instruction sequences. I’ve thought of doing something like this in the future
with the MachineOutliner and seeing how it goes.
>>> 
>>> - Jessica
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
>> 
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/086ae195/attachment.html>

River Riddle via llvm-dev

2017-Jul-25 03:31 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hey Quentin,
 To clarify some of your concerns/questions:
- Currently parameter/output passing are based around the simple cost=1,
though this is something that will need to change a bit, as you have
pointed out. As I said, the cost model isn't perfect.
- The output that is turned into a return is either: one that will remove
an output parameter from the function call, or the last output in the
instruction sequence.
- As for being conservative with the cost, we prefer to overestimate the
cost of outlining(New function cost) and under estimate the cost of the
instruction sequence when possible. This means that we tend to under
estimate the benefit, given that we are working with approximate costs.
- When I speak of regressions I am referring to the size of the text
section.
- The gist of identifying semantic equivalence is the very first sentence
of Candidate Selection in the initial RFC. We identify equivalence
generally via Type,Opcode, operand types, any special state. This is just
the current definition, we can relax this even further if we want to.
- Working at the IR level has more target independent advantages than just
reducing the amount of new hooks. This can be seen in the fact that: we
don't require mno-red-zone or any other abi flag/check, we can easily
parameterize sequences, generate outputs from sequences, etc.
- By combining work I mean that the ability for adding advancements are
much easier. if someone wants to add functionality for outlining regions
they don't have to worry about any target specific patchups. It just works,
for every target.
Both outliners require hooks to get better performance but with IR outliner
these hooks are limited to the cost model and not the outlining process
itself.
Even though we disagree, I do appreciate the discussion and feedback!
 Thanks,
River Riddle

On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Hi River,
>
> Thanks for the detailed explanation.
>
> If people are okay for you to move forward, like I said to Andrey, I won’t
> oppose. I feel sad we have to split our effort on outlining technology, but
> I certainly don’t pretend to know what is best!
> The bottom line is if people are happy with that going in, the
> conversation on the details can continue in parallel.
>
> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com>
wrote:
>
> Hey Quentin,
>  Sorry I missed the last question. Currently it doesn't do continual
> outlining, but it's definitely a possibility.
>
>
> Ok, that means we probably won’t have the very bad runtime problems I had
> in mind with adding a lot of indirections.
>
> Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at
gmail.com>
> wrote:
>
>> Hi Quentin,
>>   I understand your points and I believe that some meaning is being
lost
>> via email. For performance it's true that that cost isn't
necessarily
>> modeled, there is currently only support for using profile data to
avoid
>> mitigate that. I was working under the assumption, possibly
incorrectly,
>> that at Oz we favor small code over anything else including runtime
>> performance. This is why we only run at Os if we have profile data, and
>> even then we are very conservative about what we outline from.
>>   I also understand that some target hooks may be required for certain
>> things, it happens for other IR level passes as well. I just want to
>> minimize that behavior as much as I can, though thats a personal
choice.
>> As for a motivating reason to push for this, it actually solves some of
>> the problems that you brought up in the post for the original Machine
>> Outliner RFC. If I can quote you:
>>
>> "So basically, better cost model. That's one part of the
story.
>>
>> The other part is at the LLVM IR level or before register allocation
>> identifying similar code sequence is much harder, at least with a
suffix
>> tree like algorithm. Basically the problem is how do we name our
>> instructions such that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f;
>> }
>>
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>>
>> With proper renaming, we can outline both adds in one function. The
>> difficulty is to recognize that they are semantically equivalent to
give
>> them the same identifier in the suffix tree. I won’t get into the
details
>> but it gets tricky quickly. We were thinking of reusing GVN to have
such
>> identifier if we wanted to do the outlining at IR level but solving
this
>> problem is hard."
>>
>> This outliner will catch your case, it is actually one of the easiest
>> cases for it to catch. The outliner can recognize semantically
equivalent
>> instructions and can be expanded to catch even more.
>>
>
> Interesting, could you explain how you do that?
> I didn’t see it in the original post.
>
>
>> As for the cost model it is quite simple:
>>  - We identify all of the external inputs into the sequence. For
>> estimating the benefit we constant fold and condense the inputs so that
we
>> can get the set of *unique* inputs into the sequence.
>>
>
> Ok, those are your parameters. How do you account for the cost of setting
> up those parameters?
>
>  - We also identify outputs from the sequence, instructions that have
>> external references. We add the cost of storing/loading/passing output
>> parameter to the outlined function.
>>
>
> Ok, those are your return values (or your output parameter). I see the
> cost computation you do on those, but it still miss the general parameter
> setup cost.
>
>  - We identify one output to promote to a return value for the function.
>> This can end up reducing an output parameter that is passed to our
outlined
>> function.
>>
>
> How do you choose that one?
>
>  - We estimate the cost of the sequence of instructions by mostly using
>> TTI.
>>
>
> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>
>  - With the above we can estimate the cost per occurrence as well as the
>> cost for the new function. Of course these are mostly estimates, we
lean
>> towards the conservative side to be safe.
>>
>
> Conservative in what sense? Put differently how do you know your cost is
> conservative?
>
>  - Finally we can compute an estimated benefit for the sequence taking
>> into account benefit per occurrence and the estimated cost of the new
>> function.
>>
>
> Make sense.
>
>
>> There is definitely room for improvement in the cost model. I do not
>> believe its the best but its shown to be somewhat reliable for most
cases.
>> It has benefits and it has regressions, as does the machine outliner.
>>
>
> Regressions in what sense?
> Do you actually have functions that are bigger?
>
> To clarify, AFAIR, the machine outliner does not have such regressions per
> say. The outliner does perfectly what the cost model predicted: functions
> are individually smaller. Code size grow may happen because of padding in
> the object file (and getting in the way of some linker optimization).
>
> The reasoning for adding the new framework, is because I believe that this
>> transformation should exist at the IR level. Not only because it is the
>> simplest place to put it but also because it provides the greatest
>> opportunities for improvement with the largest coverage.
>>
>
> Again I don’t get those arguments. Machine passes can be target
> independent. Sure they may require the targets to adopt new hooks but
> that’s about it and I don’t see how this is different from having to add
> adopt hooks from TTI. I am guessing you can mostly achieve your goals with
> the existing TTI hooks, but the arguments passing ones, which is the part I
> missed in your explanation :).
>
> We can expand the framework to catch Danny's cases from the machine
>> outliner RFC, we can add region outlining, etc. We can do all of this
and
>> make it available to every target automatically. The reason why I am
for
>> this being at the IR level is not because I want to split up the
technical
>> effort, but combine it.
>>
>
> I fail to see the combining part here :).
>
> To give you a bit context, I am afraid that, indeed, the implementation is
> going to be simpler (partly because a lot of people are more familiar with
> LLVM IR than MachineInstr) and for the most part TTI is going to be good
> enough. However, when we will want to go the extra mile and do crazy stuff,
> we will, like with the vectorizer IMHO, hit that wall of TTI  abstractions
> that we are going to work around in various ways. I’d like, if at all
> possible, to avoid repeating the history here.
>
> Cheers,
> -Quentin
>
>
>> Thanks,
>>   River Riddle
>>
>>
>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at
apple.com>
>> wrote:
>>
>>> Hi River,
>>>
>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com> wrote:
>>>
>>> Hi Quentin,
>>>  I appreciate the feedback. When I reference the cost of Target
Hooks
>>> it's mainly for maintainability and cost on a target author. We
want to
>>> keep the intrusion into target information minimized. The
heuristics used
>>> for the outliner are the same used by any other IR level pass
seeking
>>> target information, i.e TTI for the most part. I can see where you
are
>>> coming from with "having heuristics solely focused on code
size do not
>>> seem realistic", but I don't agree with that statement.
>>>
>>>
>>> If you only want code size I agree it makes sense, but I believe,
even
>>> in Oz, we probably don’t want to slow the code by a big factor for
a couple
>>> bytes. That’s what I wanted to say and what I wanted to point out
is that
>>> you need to have some kind of model for the performance to avoid
those
>>> worst cases. Unless we don’t care :).
>>>
>>> I think there is a disconnect on heuristics. The only user tunable
>>> parameters are the lower bound parameters(to the cost model), the
actual
>>> analysis(heuristic calculation) is based upon TTI information.
>>>
>>>
>>> I don’t see how you can get around adding more hooks to know how a
>>> specific function prototype is going to be lowered (e.g., i64 needs
to be
>>> split into two registers, fourth and onward parameters need to be
pushed on
>>> the stack and so on). Those change the code size benefit.
>>>
>>>  When you say "Would still be interesting to see how well this
could
>>> perform on some exact model (i.e., at the Machine level),
IMO." I am
>>> slightly confused as to what you mean. I do not intend to try and
implement
>>> this algorithm at the MIR level given that it exists in Machine
Outliner.
>>>
>>>
>>> Of course, I don’t expect you to do that :). What I meant is that
the
>>> claim that IR offers the better trade off is not based on hard
evidences. I
>>> actually don’t buy it.
>>>
>>> My point was to make sure I understand what you are trying to solve
and
>>> given you have mentioned the MachineOutliner, why you are not
working on
>>> improving it instead of suggesting a new framework.
>>> Don’t take me wrong, maybe creating a new framework at the IR level
is
>>> the right thing to do, but I still didn’t get that from your
comments.
>>>
>>> There are several comparison benchmarks given in the "More
detailed
>>> performance data" of the original RFC. It includes comparisons
to the
>>> Machine Outliner when possible(I can't build clang on Linux
with Machine
>>> Outliner). I welcome any and all discussion on the placement of the
>>> outliner in LLVM.
>>>
>>>
>>> My fear with a new framework is that we are going to split the
effort
>>> for pushing the outliner technology forward and I’d like to avoid
that if
>>> at all possible.
>>>
>>> Now, to be more concrete on your proposal, could you describe the
cost
>>> model for deciding what to outline? (Really the cost model, not the
suffix
>>> algo.)
>>> Are outlined functions pushed into the list candidates for further
>>> outlining?
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>  Thanks,
>>> River Riddle
>>>
>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com>
>>> wrote:
>>>
>>>> Hi River,
>>>>
>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Hi Jessica,
>>>>  The comparison to the inliner is an interesting one but we
think it's
>>>> important to note the difference in the use of heuristics. The
inliner is
>>>> juggling many different tasks at the same time, execution
speed, code size,
>>>> etc. which can cause the parameters to be very sensitive
depending on the
>>>> benchmark/platform/etc. The outliners heuristics are focused
solely on the
>>>> potential code size savings from outlining, and is thus only
sensitive to
>>>> the current platform. This only creates a problem when we are
over
>>>> estimating the potential cost of a set of instructions for a
particular
>>>> target. The cost model parameters are only minimums:
instruction sequence
>>>> length, estimated benefit, occurrence amount. The heuristics
themselves are
>>>> conservative and based upon all of the target information
available at the
>>>> IR level, the parameters are just setting a lower bound to weed
out any
>>>> outliers. You are correct in that being at the machine level,
before or
>>>> after RA, will give the most accurate heuristics but we feel
there's an
>>>> advantage to being at the IR level. At the IR level we can do
so many more
>>>> things that are either too difficult/complex for the machine
level(e.g
>>>> parameterization/outputs/etc). Not only can we do these things
but they are
>>>> available on all targets immediately, without the need for
target hooks.
>>>> The caution on the use of heuristics is understandable, but
there comes a
>>>> point when trade offs need to be made. We made the trade off
for a loss in
>>>> exact cost modeling to gain flexibility, coverage, and
potential for
>>>> further features. This trade off is the same made for quite a
few IR level
>>>> optimizations, including inlining. As for the worry about code
size
>>>> regressions, so far the results seem to support our hypothesis.
>>>>
>>>>
>>>> Would still be interesting to see how well this could perform
on some
>>>> exact model (i.e., at the Machine level), IMO. Target hooks are
cheap and
>>>> choosing an implementation because it is simpler might not be
the right
>>>> long term solution.
>>>> At the very least, to know what trade-off we are making, having
>>>> prototypes with the different approaches sounds sensible.
>>>> In particular, all the heuristics about cost for parameter
passing
>>>> (haven’t checked how you did it) sounds already complex enough
and would
>>>> require target hooks. Therefore, I am not seeing a clear win
with an IR
>>>> approach here.
>>>>
>>>> Finally, having heuristics solely focused on code size do not
seem
>>>> realistic. Indeed, I am guessing you have some thresholds to
avoid
>>>> outlining some piece of code too small that would end up adding
a whole lot
>>>> of indirections and I don’t like magic numbers in general :).
>>>>
>>>> To summarize, I wanted to point out that an IR approach is not
as a
>>>> clear win as you describe and would thus deserve more
discussion.
>>>>
>>>> Cheers,
>>>> -Quentin
>>>>
>>>>  Thanks,
>>>> River Riddle
>>>>
>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette
<jpaquette at apple.com
>>>> > wrote:
>>>>
>>>>>
>>>>> Hi River,
>>>>>
>>>>> I’m working on the MachineOutliner pass at the MIR level.
Working at
>>>>> the IR level sounds interesting! It also seems like our
algorithms are
>>>>> similar. I was thinking of taking the suffix array route
with the
>>>>> MachineOutliner in the future.
>>>>>
>>>>> Anyway, I’d like to ask about this:
>>>>>
>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> The downside to having this type of transformation be at
the IR level
>>>>> is it means there will be less accuracy in the cost model -
we can
>>>>> somewhat accurately model the cost per instruction but we
can’t get
>>>>> information on how a window of instructions may lower. This
can cause
>>>>> regressions depending on the platform/codebase, therefore
to help alleviate
>>>>> this there are several tunable parameters for the cost
model.
>>>>>
>>>>>
>>>>> The inliner is threshold-based and it can be rather
unpredictable how
>>>>> it will impact the code size of a program. Do you have any
idea as to how
>>>>> heuristics/thresholds/parameters could be tuned to prevent
this? In
>>>>> my experience, making good code size decisions with these
sorts of passes
>>>>> requires a lot of knowledge about what instructions you’re
dealing with
>>>>> exactly. I’ve seen the inliner cause some pretty serious
code size
>>>>> regressions in projects due to small changes to the cost
model/parameters
>>>>> which cause improvements in other projects. I’m a little
worried that an
>>>>> IR-level outliner for code size would be doomed to a
similar fate.
>>>>>
>>>>> Perhaps it would be interesting to run this sort of pass
pre-register
>>>>> allocation? This would help pull you away from having to
use heuristics,
>>>>> but give you some more opportunities for finding repeated
instruction
>>>>> sequences. I’ve thought of doing something like this in the
future with the
>>>>> MachineOutliner and seeing how it goes.
>>>>>
>>>>> - Jessica
>>>>>
>>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/440a3fa8/attachment-0001.html>

Sean Silva via llvm-dev

2017-Jul-25 04:02 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi River,
>
> Thanks for the detailed explanation.
>
> If people are okay for you to move forward, like I said to Andrey, I won’t
> oppose. I feel sad we have to split our effort on outlining technology, but
> I certainly don’t pretend to know what is best!
> The bottom line is if people are happy with that going in, the
> conversation on the details can continue in parallel.
>
> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com>
wrote:
>
> Hey Quentin,
>  Sorry I missed the last question. Currently it doesn't do continual
> outlining, but it's definitely a possibility.
>
>
> Ok, that means we probably won’t have the very bad runtime problems I had
> in mind with adding a lot of indirections.
>
> Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at
gmail.com>
> wrote:
>
>> Hi Quentin,
>>   I understand your points and I believe that some meaning is being
lost
>> via email. For performance it's true that that cost isn't
necessarily
>> modeled, there is currently only support for using profile data to
avoid
>> mitigate that. I was working under the assumption, possibly
incorrectly,
>> that at Oz we favor small code over anything else including runtime
>> performance. This is why we only run at Os if we have profile data, and
>> even then we are very conservative about what we outline from.
>>   I also understand that some target hooks may be required for certain
>> things, it happens for other IR level passes as well. I just want to
>> minimize that behavior as much as I can, though thats a personal
choice.
>> As for a motivating reason to push for this, it actually solves some of
>> the problems that you brought up in the post for the original Machine
>> Outliner RFC. If I can quote you:
>>
>> "So basically, better cost model. That's one part of the
story.
>>
>> The other part is at the LLVM IR level or before register allocation
>> identifying similar code sequence is much harder, at least with a
suffix
>> tree like algorithm. Basically the problem is how do we name our
>> instructions such that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f;
>> }
>>
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>>
>> With proper renaming, we can outline both adds in one function. The
>> difficulty is to recognize that they are semantically equivalent to
give
>> them the same identifier in the suffix tree. I won’t get into the
details
>> but it gets tricky quickly. We were thinking of reusing GVN to have
such
>> identifier if we wanted to do the outlining at IR level but solving
this
>> problem is hard."
>>
>> This outliner will catch your case, it is actually one of the easiest
>> cases for it to catch. The outliner can recognize semantically
equivalent
>> instructions and can be expanded to catch even more.
>>
>
> Interesting, could you explain how you do that?
> I didn’t see it in the original post.
>
>
>> As for the cost model it is quite simple:
>>  - We identify all of the external inputs into the sequence. For
>> estimating the benefit we constant fold and condense the inputs so that
we
>> can get the set of *unique* inputs into the sequence.
>>
>
> Ok, those are your parameters. How do you account for the cost of setting
> up those parameters?
>
>  - We also identify outputs from the sequence, instructions that have
>> external references. We add the cost of storing/loading/passing output
>> parameter to the outlined function.
>>
>
> Ok, those are your return values (or your output parameter). I see the
> cost computation you do on those, but it still miss the general parameter
> setup cost.
>
>  - We identify one output to promote to a return value for the function.
>> This can end up reducing an output parameter that is passed to our
outlined
>> function.
>>
>
> How do you choose that one?
>
>  - We estimate the cost of the sequence of instructions by mostly using
>> TTI.
>>
>
> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>
>  - With the above we can estimate the cost per occurrence as well as the
>> cost for the new function. Of course these are mostly estimates, we
lean
>> towards the conservative side to be safe.
>>
>
> Conservative in what sense? Put differently how do you know your cost is
> conservative?
>
>  - Finally we can compute an estimated benefit for the sequence taking
>> into account benefit per occurrence and the estimated cost of the new
>> function.
>>
>
> Make sense.
>
>
>> There is definitely room for improvement in the cost model. I do not
>> believe its the best but its shown to be somewhat reliable for most
cases.
>> It has benefits and it has regressions, as does the machine outliner.
>>
>
> Regressions in what sense?
> Do you actually have functions that are bigger?
>
> To clarify, AFAIR, the machine outliner does not have such regressions per
> say. The outliner does perfectly what the cost model predicted: functions
> are individually smaller. Code size grow may happen because of padding in
> the object file (and getting in the way of some linker optimization).
>

The current MIR outliner doesn't take into account instruction encoding
length either. Considering that on x86 instructions can commonly be both 1
and 10+ bytes long, the variability from not modeling that is probably
comparable to the inaccuracy of estimating a fixed cost for an LLVM IR
instruction w.r.t. its lowering to machine code.

As a simple example, many commonly used x86 instructions encode as over 5
bytes, which is the size of a call instruction. So an outlined function
that consists of a single instruction can be profitable. IIRC there was
about 5% code size saving just from outlining single instructions (that
encode at >5 bytes) at machine code level on the test case I looked at (I
mentined it in a comment on one of Jessica's patches IIRC).

Do we have a way to get an instruction's exact encoded length for the MIR
outliner?

-- Sean Silva

>
> The reasoning for adding the new framework, is because I believe that this
>> transformation should exist at the IR level. Not only because it is the
>> simplest place to put it but also because it provides the greatest
>> opportunities for improvement with the largest coverage.
>>
>
> Again I don’t get those arguments. Machine passes can be target
> independent. Sure they may require the targets to adopt new hooks but
> that’s about it and I don’t see how this is different from having to add
> adopt hooks from TTI. I am guessing you can mostly achieve your goals with
> the existing TTI hooks, but the arguments passing ones, which is the part I
> missed in your explanation :).
>
> We can expand the framework to catch Danny's cases from the machine
>> outliner RFC, we can add region outlining, etc. We can do all of this
and
>> make it available to every target automatically. The reason why I am
for
>> this being at the IR level is not because I want to split up the
technical
>> effort, but combine it.
>>
>
> I fail to see the combining part here :).
>
> To give you a bit context, I am afraid that, indeed, the implementation is
> going to be simpler (partly because a lot of people are more familiar with
> LLVM IR than MachineInstr) and for the most part TTI is going to be good
> enough. However, when we will want to go the extra mile and do crazy stuff,
> we will, like with the vectorizer IMHO, hit that wall of TTI  abstractions
> that we are going to work around in various ways. I’d like, if at all
> possible, to avoid repeating the history here.
>
> Cheers,
> -Quentin
>
>
>> Thanks,
>>   River Riddle
>>
>>
>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at
apple.com>
>> wrote:
>>
>>> Hi River,
>>>
>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com> wrote:
>>>
>>> Hi Quentin,
>>>  I appreciate the feedback. When I reference the cost of Target
Hooks
>>> it's mainly for maintainability and cost on a target author. We
want to
>>> keep the intrusion into target information minimized. The
heuristics used
>>> for the outliner are the same used by any other IR level pass
seeking
>>> target information, i.e TTI for the most part. I can see where you
are
>>> coming from with "having heuristics solely focused on code
size do not
>>> seem realistic", but I don't agree with that statement.
>>>
>>>
>>> If you only want code size I agree it makes sense, but I believe,
even
>>> in Oz, we probably don’t want to slow the code by a big factor for
a couple
>>> bytes. That’s what I wanted to say and what I wanted to point out
is that
>>> you need to have some kind of model for the performance to avoid
those
>>> worst cases. Unless we don’t care :).
>>>
>>> I think there is a disconnect on heuristics. The only user tunable
>>> parameters are the lower bound parameters(to the cost model), the
actual
>>> analysis(heuristic calculation) is based upon TTI information.
>>>
>>>
>>> I don’t see how you can get around adding more hooks to know how a
>>> specific function prototype is going to be lowered (e.g., i64 needs
to be
>>> split into two registers, fourth and onward parameters need to be
pushed on
>>> the stack and so on). Those change the code size benefit.
>>>
>>>  When you say "Would still be interesting to see how well this
could
>>> perform on some exact model (i.e., at the Machine level),
IMO." I am
>>> slightly confused as to what you mean. I do not intend to try and
implement
>>> this algorithm at the MIR level given that it exists in Machine
Outliner.
>>>
>>>
>>> Of course, I don’t expect you to do that :). What I meant is that
the
>>> claim that IR offers the better trade off is not based on hard
evidences. I
>>> actually don’t buy it.
>>>
>>> My point was to make sure I understand what you are trying to solve
and
>>> given you have mentioned the MachineOutliner, why you are not
working on
>>> improving it instead of suggesting a new framework.
>>> Don’t take me wrong, maybe creating a new framework at the IR level
is
>>> the right thing to do, but I still didn’t get that from your
comments.
>>>
>>> There are several comparison benchmarks given in the "More
detailed
>>> performance data" of the original RFC. It includes comparisons
to the
>>> Machine Outliner when possible(I can't build clang on Linux
with Machine
>>> Outliner). I welcome any and all discussion on the placement of the
>>> outliner in LLVM.
>>>
>>>
>>> My fear with a new framework is that we are going to split the
effort
>>> for pushing the outliner technology forward and I’d like to avoid
that if
>>> at all possible.
>>>
>>> Now, to be more concrete on your proposal, could you describe the
cost
>>> model for deciding what to outline? (Really the cost model, not the
suffix
>>> algo.)
>>> Are outlined functions pushed into the list candidates for further
>>> outlining?
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>  Thanks,
>>> River Riddle
>>>
>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com>
>>> wrote:
>>>
>>>> Hi River,
>>>>
>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Hi Jessica,
>>>>  The comparison to the inliner is an interesting one but we
think it's
>>>> important to note the difference in the use of heuristics. The
inliner is
>>>> juggling many different tasks at the same time, execution
speed, code size,
>>>> etc. which can cause the parameters to be very sensitive
depending on the
>>>> benchmark/platform/etc. The outliners heuristics are focused
solely on the
>>>> potential code size savings from outlining, and is thus only
sensitive to
>>>> the current platform. This only creates a problem when we are
over
>>>> estimating the potential cost of a set of instructions for a
particular
>>>> target. The cost model parameters are only minimums:
instruction sequence
>>>> length, estimated benefit, occurrence amount. The heuristics
themselves are
>>>> conservative and based upon all of the target information
available at the
>>>> IR level, the parameters are just setting a lower bound to weed
out any
>>>> outliers. You are correct in that being at the machine level,
before or
>>>> after RA, will give the most accurate heuristics but we feel
there's an
>>>> advantage to being at the IR level. At the IR level we can do
so many more
>>>> things that are either too difficult/complex for the machine
level(e.g
>>>> parameterization/outputs/etc). Not only can we do these things
but they are
>>>> available on all targets immediately, without the need for
target hooks.
>>>> The caution on the use of heuristics is understandable, but
there comes a
>>>> point when trade offs need to be made. We made the trade off
for a loss in
>>>> exact cost modeling to gain flexibility, coverage, and
potential for
>>>> further features. This trade off is the same made for quite a
few IR level
>>>> optimizations, including inlining. As for the worry about code
size
>>>> regressions, so far the results seem to support our hypothesis.
>>>>
>>>>
>>>> Would still be interesting to see how well this could perform
on some
>>>> exact model (i.e., at the Machine level), IMO. Target hooks are
cheap and
>>>> choosing an implementation because it is simpler might not be
the right
>>>> long term solution.
>>>> At the very least, to know what trade-off we are making, having
>>>> prototypes with the different approaches sounds sensible.
>>>> In particular, all the heuristics about cost for parameter
passing
>>>> (haven’t checked how you did it) sounds already complex enough
and would
>>>> require target hooks. Therefore, I am not seeing a clear win
with an IR
>>>> approach here.
>>>>
>>>> Finally, having heuristics solely focused on code size do not
seem
>>>> realistic. Indeed, I am guessing you have some thresholds to
avoid
>>>> outlining some piece of code too small that would end up adding
a whole lot
>>>> of indirections and I don’t like magic numbers in general :).
>>>>
>>>> To summarize, I wanted to point out that an IR approach is not
as a
>>>> clear win as you describe and would thus deserve more
discussion.
>>>>
>>>> Cheers,
>>>> -Quentin
>>>>
>>>>  Thanks,
>>>> River Riddle
>>>>
>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette
<jpaquette at apple.com
>>>> > wrote:
>>>>
>>>>>
>>>>> Hi River,
>>>>>
>>>>> I’m working on the MachineOutliner pass at the MIR level.
Working at
>>>>> the IR level sounds interesting! It also seems like our
algorithms are
>>>>> similar. I was thinking of taking the suffix array route
with the
>>>>> MachineOutliner in the future.
>>>>>
>>>>> Anyway, I’d like to ask about this:
>>>>>
>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> The downside to having this type of transformation be at
the IR level
>>>>> is it means there will be less accuracy in the cost model -
we can
>>>>> somewhat accurately model the cost per instruction but we
can’t get
>>>>> information on how a window of instructions may lower. This
can cause
>>>>> regressions depending on the platform/codebase, therefore
to help alleviate
>>>>> this there are several tunable parameters for the cost
model.
>>>>>
>>>>>
>>>>> The inliner is threshold-based and it can be rather
unpredictable how
>>>>> it will impact the code size of a program. Do you have any
idea as to how
>>>>> heuristics/thresholds/parameters could be tuned to prevent
this? In
>>>>> my experience, making good code size decisions with these
sorts of passes
>>>>> requires a lot of knowledge about what instructions you’re
dealing with
>>>>> exactly. I’ve seen the inliner cause some pretty serious
code size
>>>>> regressions in projects due to small changes to the cost
model/parameters
>>>>> which cause improvements in other projects. I’m a little
worried that an
>>>>> IR-level outliner for code size would be doomed to a
similar fate.
>>>>>
>>>>> Perhaps it would be interesting to run this sort of pass
pre-register
>>>>> allocation? This would help pull you away from having to
use heuristics,
>>>>> but give you some more opportunities for finding repeated
instruction
>>>>> sequences. I’ve thought of doing something like this in the
future with the
>>>>> MachineOutliner and seeing how it goes.
>>>>>
>>>>> - Jessica
>>>>>
>>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/f2c35973/attachment.html>

Sean Silva via llvm-dev

2017-Jul-25 04:05 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Mon, Jul 24, 2017 at 8:31 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hey Quentin,
>  To clarify some of your concerns/questions:
> - Currently parameter/output passing are based around the simple cost=1,
> though this is something that will need to change a bit, as you have
> pointed out. As I said, the cost model isn't perfect.
> - The output that is turned into a return is either: one that will remove
> an output parameter from the function call, or the last output in the
> instruction sequence.
> - As for being conservative with the cost, we prefer to overestimate the
> cost of outlining(New function cost) and under estimate the cost of the
> instruction sequence when possible. This means that we tend to under
> estimate the benefit, given that we are working with approximate costs.
> - When I speak of regressions I am referring to the size of the text
> section.
> - The gist of identifying semantic equivalence is the very first sentence
> of Candidate Selection in the initial RFC. We identify equivalence
> generally via Type,Opcode, operand types, any special state. This is just
> the current definition, we can relax this even further if we want to.
> - Working at the IR level has more target independent advantages than just
> reducing the amount of new hooks. This can be seen in the fact that: we
> don't require mno-red-zone or any other abi flag/check, we can easily
> parameterize sequences, generate outputs from sequences, etc.
> - By combining work I mean that the ability for adding advancements are
> much easier. if someone wants to add functionality for outlining regions
> they don't have to worry about any target specific patchups. It just
works,
> for every target.
>
I also think it would be substantially difficult to implement the
parameterized outlining at machine level. Do you have any numbers on how
much the benefit comes from parameterization?

-- Sean Silva

> Both outliners require hooks to get better performance but with IR
> outliner these hooks are limited to the cost model and not the outlining
> process itself.
> Even though we disagree, I do appreciate the discussion and feedback!
>  Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
>> Hi River,
>>
>> Thanks for the detailed explanation.
>>
>> If people are okay for you to move forward, like I said to Andrey, I
>> won’t oppose. I feel sad we have to split our effort on outlining
>> technology, but I certainly don’t pretend to know what is best!
>> The bottom line is if people are happy with that going in, the
>> conversation on the details can continue in parallel.
>>
>> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at
gmail.com> wrote:
>>
>> Hey Quentin,
>>  Sorry I missed the last question. Currently it doesn't do
continual
>> outlining, but it's definitely a possibility.
>>
>>
>> Ok, that means we probably won’t have the very bad runtime problems I
had
>> in mind with adding a lot of indirections.
>>
>> Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at
gmail.com>
>> wrote:
>>
>>> Hi Quentin,
>>>   I understand your points and I believe that some meaning is being
lost
>>> via email. For performance it's true that that cost isn't
necessarily
>>> modeled, there is currently only support for using profile data to
avoid
>>> mitigate that. I was working under the assumption, possibly
incorrectly,
>>> that at Oz we favor small code over anything else including runtime
>>> performance. This is why we only run at Os if we have profile data,
and
>>> even then we are very conservative about what we outline from.
>>>   I also understand that some target hooks may be required for
certain
>>> things, it happens for other IR level passes as well. I just want
to
>>> minimize that behavior as much as I can, though thats a personal
choice.
>>> As for a motivating reason to push for this, it actually solves
some of
>>> the problems that you brought up in the post for the original
Machine
>>> Outliner RFC. If I can quote you:
>>>
>>> "So basically, better cost model. That's one part of the
story.
>>>
>>> The other part is at the LLVM IR level or before register
allocation
>>> identifying similar code sequence is much harder, at least with a
suffix
>>> tree like algorithm. Basically the problem is how do we name our
>>> instructions such that we can match them.
>>> Let me take an example.
>>> foo() {
>>> /* bunch of code */
>>> a = add b, c;
>>> d = add e, f;
>>> }
>>>
>>> bar() {
>>> d = add e, g;
>>> f = add c, w;
>>> }
>>>
>>> With proper renaming, we can outline both adds in one function. The
>>> difficulty is to recognize that they are semantically equivalent to
give
>>> them the same identifier in the suffix tree. I won’t get into the
details
>>> but it gets tricky quickly. We were thinking of reusing GVN to have
such
>>> identifier if we wanted to do the outlining at IR level but solving
this
>>> problem is hard."
>>>
>>> This outliner will catch your case, it is actually one of the
easiest
>>> cases for it to catch. The outliner can recognize semantically
equivalent
>>> instructions and can be expanded to catch even more.
>>>
>>
>> Interesting, could you explain how you do that?
>> I didn’t see it in the original post.
>>
>>
>>> As for the cost model it is quite simple:
>>>  - We identify all of the external inputs into the sequence. For
>>> estimating the benefit we constant fold and condense the inputs so
that we
>>> can get the set of *unique* inputs into the sequence.
>>>
>>
>> Ok, those are your parameters. How do you account for the cost of
setting
>> up those parameters?
>>
>>  - We also identify outputs from the sequence, instructions that have
>>> external references. We add the cost of storing/loading/passing
output
>>> parameter to the outlined function.
>>>
>>
>> Ok, those are your return values (or your output parameter). I see the
>> cost computation you do on those, but it still miss the general
parameter
>> setup cost.
>>
>>  - We identify one output to promote to a return value for the
function.
>>> This can end up reducing an output parameter that is passed to our
outlined
>>> function.
>>>
>>
>> How do you choose that one?
>>
>>  - We estimate the cost of the sequence of instructions by mostly using
>>> TTI.
>>>
>>
>> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>>
>>  - With the above we can estimate the cost per occurrence as well as
the
>>> cost for the new function. Of course these are mostly estimates, we
lean
>>> towards the conservative side to be safe.
>>>
>>
>> Conservative in what sense? Put differently how do you know your cost
is
>> conservative?
>>
>>  - Finally we can compute an estimated benefit for the sequence taking
>>> into account benefit per occurrence and the estimated cost of the
new
>>> function.
>>>
>>
>> Make sense.
>>
>>
>>> There is definitely room for improvement in the cost model. I do
not
>>> believe its the best but its shown to be somewhat reliable for most
cases.
>>> It has benefits and it has regressions, as does the machine
outliner.
>>>
>>
>> Regressions in what sense?
>> Do you actually have functions that are bigger?
>>
>> To clarify, AFAIR, the machine outliner does not have such regressions
>> per say. The outliner does perfectly what the cost model predicted:
>> functions are individually smaller. Code size grow may happen because
of
>> padding in the object file (and getting in the way of some linker
>> optimization).
>>
>> The reasoning for adding the new framework, is because I believe that
>>> this transformation should exist at the IR level. Not only because
it is
>>> the simplest place to put it but also because it provides the
greatest
>>> opportunities for improvement with the largest coverage.
>>>
>>
>> Again I don’t get those arguments. Machine passes can be target
>> independent. Sure they may require the targets to adopt new hooks but
>> that’s about it and I don’t see how this is different from having to
add
>> adopt hooks from TTI. I am guessing you can mostly achieve your goals
with
>> the existing TTI hooks, but the arguments passing ones, which is the
part I
>> missed in your explanation :).
>>
>> We can expand the framework to catch Danny's cases from the machine
>>> outliner RFC, we can add region outlining, etc. We can do all of
this and
>>> make it available to every target automatically. The reason why I
am for
>>> this being at the IR level is not because I want to split up the
technical
>>> effort, but combine it.
>>>
>>
>> I fail to see the combining part here :).
>>
>> To give you a bit context, I am afraid that, indeed, the implementation
>> is going to be simpler (partly because a lot of people are more
familiar
>> with LLVM IR than MachineInstr) and for the most part TTI is going to
be
>> good enough. However, when we will want to go the extra mile and do
crazy
>> stuff, we will, like with the vectorizer IMHO, hit that wall of TTI
>>  abstractions that we are going to work around in various ways. I’d
like,
>> if at all possible, to avoid repeating the history here.
>>
>> Cheers,
>> -Quentin
>>
>>
>>> Thanks,
>>>   River Riddle
>>>
>>>
>>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at
apple.com>
>>> wrote:
>>>
>>>> Hi River,
>>>>
>>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com>
>>>> wrote:
>>>>
>>>> Hi Quentin,
>>>>  I appreciate the feedback. When I reference the cost of Target
Hooks
>>>> it's mainly for maintainability and cost on a target
author. We want to
>>>> keep the intrusion into target information minimized. The
heuristics used
>>>> for the outliner are the same used by any other IR level pass
seeking
>>>> target information, i.e TTI for the most part. I can see where
you are
>>>> coming from with "having heuristics solely focused on code
size do not
>>>> seem realistic", but I don't agree with that
statement.
>>>>
>>>>
>>>> If you only want code size I agree it makes sense, but I
believe, even
>>>> in Oz, we probably don’t want to slow the code by a big factor
for a couple
>>>> bytes. That’s what I wanted to say and what I wanted to point
out is that
>>>> you need to have some kind of model for the performance to
avoid those
>>>> worst cases. Unless we don’t care :).
>>>>
>>>> I think there is a disconnect on heuristics. The only user
tunable
>>>> parameters are the lower bound parameters(to the cost model),
the actual
>>>> analysis(heuristic calculation) is based upon TTI information.
>>>>
>>>>
>>>> I don’t see how you can get around adding more hooks to know
how a
>>>> specific function prototype is going to be lowered (e.g., i64
needs to be
>>>> split into two registers, fourth and onward parameters need to
be pushed on
>>>> the stack and so on). Those change the code size benefit.
>>>>
>>>>  When you say "Would still be interesting to see how well
this could
>>>> perform on some exact model (i.e., at the Machine level),
IMO." I am
>>>> slightly confused as to what you mean. I do not intend to try
and implement
>>>> this algorithm at the MIR level given that it exists in Machine
Outliner.
>>>>
>>>>
>>>> Of course, I don’t expect you to do that :). What I meant is
that the
>>>> claim that IR offers the better trade off is not based on hard
evidences. I
>>>> actually don’t buy it.
>>>>
>>>> My point was to make sure I understand what you are trying to
solve and
>>>> given you have mentioned the MachineOutliner, why you are not
working on
>>>> improving it instead of suggesting a new framework.
>>>> Don’t take me wrong, maybe creating a new framework at the IR
level is
>>>> the right thing to do, but I still didn’t get that from your
comments.
>>>>
>>>> There are several comparison benchmarks given in the "More
detailed
>>>> performance data" of the original RFC. It includes
comparisons to the
>>>> Machine Outliner when possible(I can't build clang on Linux
with Machine
>>>> Outliner). I welcome any and all discussion on the placement of
the
>>>> outliner in LLVM.
>>>>
>>>>
>>>> My fear with a new framework is that we are going to split the
effort
>>>> for pushing the outliner technology forward and I’d like to
avoid that if
>>>> at all possible.
>>>>
>>>> Now, to be more concrete on your proposal, could you describe
the cost
>>>> model for deciding what to outline? (Really the cost model, not
the suffix
>>>> algo.)
>>>> Are outlined functions pushed into the list candidates for
further
>>>> outlining?
>>>>
>>>> Cheers,
>>>> -Quentin
>>>>
>>>>  Thanks,
>>>> River Riddle
>>>>
>>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet
at apple.com>
>>>> wrote:
>>>>
>>>>> Hi River,
>>>>>
>>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev
<
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> Hi Jessica,
>>>>>  The comparison to the inliner is an interesting one but we
think it's
>>>>> important to note the difference in the use of heuristics.
The inliner is
>>>>> juggling many different tasks at the same time, execution
speed, code size,
>>>>> etc. which can cause the parameters to be very sensitive
depending on the
>>>>> benchmark/platform/etc. The outliners heuristics are
focused solely on the
>>>>> potential code size savings from outlining, and is thus
only sensitive to
>>>>> the current platform. This only creates a problem when we
are over
>>>>> estimating the potential cost of a set of instructions for
a particular
>>>>> target. The cost model parameters are only minimums:
instruction sequence
>>>>> length, estimated benefit, occurrence amount. The
heuristics themselves are
>>>>> conservative and based upon all of the target information
available at the
>>>>> IR level, the parameters are just setting a lower bound to
weed out any
>>>>> outliers. You are correct in that being at the machine
level, before or
>>>>> after RA, will give the most accurate heuristics but we
feel there's an
>>>>> advantage to being at the IR level. At the IR level we can
do so many more
>>>>> things that are either too difficult/complex for the
machine level(e.g
>>>>> parameterization/outputs/etc). Not only can we do these
things but they are
>>>>> available on all targets immediately, without the need for
target hooks.
>>>>> The caution on the use of heuristics is understandable, but
there comes a
>>>>> point when trade offs need to be made. We made the trade
off for a loss in
>>>>> exact cost modeling to gain flexibility, coverage, and
potential for
>>>>> further features. This trade off is the same made for quite
a few IR level
>>>>> optimizations, including inlining. As for the worry about
code size
>>>>> regressions, so far the results seem to support our
hypothesis.
>>>>>
>>>>>
>>>>> Would still be interesting to see how well this could
perform on some
>>>>> exact model (i.e., at the Machine level), IMO. Target hooks
are cheap and
>>>>> choosing an implementation because it is simpler might not
be the right
>>>>> long term solution.
>>>>> At the very least, to know what trade-off we are making,
having
>>>>> prototypes with the different approaches sounds sensible.
>>>>> In particular, all the heuristics about cost for parameter
passing
>>>>> (haven’t checked how you did it) sounds already complex
enough and would
>>>>> require target hooks. Therefore, I am not seeing a clear
win with an IR
>>>>> approach here.
>>>>>
>>>>> Finally, having heuristics solely focused on code size do
not seem
>>>>> realistic. Indeed, I am guessing you have some thresholds
to avoid
>>>>> outlining some piece of code too small that would end up
adding a whole lot
>>>>> of indirections and I don’t like magic numbers in general
:).
>>>>>
>>>>> To summarize, I wanted to point out that an IR approach is
not as a
>>>>> clear win as you describe and would thus deserve more
discussion.
>>>>>
>>>>> Cheers,
>>>>> -Quentin
>>>>>
>>>>>  Thanks,
>>>>> River Riddle
>>>>>
>>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <
>>>>> jpaquette at apple.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi River,
>>>>>>
>>>>>> I’m working on the MachineOutliner pass at the MIR
level. Working at
>>>>>> the IR level sounds interesting! It also seems like our
algorithms are
>>>>>> similar. I was thinking of taking the suffix array
route with the
>>>>>> MachineOutliner in the future.
>>>>>>
>>>>>> Anyway, I’d like to ask about this:
>>>>>>
>>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev
<
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>> The downside to having this type of transformation be
at the IR level
>>>>>> is it means there will be less accuracy in the cost
model -  we can
>>>>>> somewhat accurately model the cost per instruction but
we can’t get
>>>>>> information on how a window of instructions may lower.
This can cause
>>>>>> regressions depending on the platform/codebase,
therefore to help alleviate
>>>>>> this there are several tunable parameters for the cost
model.
>>>>>>
>>>>>>
>>>>>> The inliner is threshold-based and it can be rather
unpredictable how
>>>>>> it will impact the code size of a program. Do you have
any idea as to how
>>>>>> heuristics/thresholds/parameters could be tuned to
prevent this? In
>>>>>> my experience, making good code size decisions with
these sorts of passes
>>>>>> requires a lot of knowledge about what instructions
you’re dealing with
>>>>>> exactly. I’ve seen the inliner cause some pretty
serious code size
>>>>>> regressions in projects due to small changes to the
cost model/parameters
>>>>>> which cause improvements in other projects. I’m a
little worried that an
>>>>>> IR-level outliner for code size would be doomed to a
similar fate.
>>>>>>
>>>>>> Perhaps it would be interesting to run this sort of
pass pre-register
>>>>>> allocation? This would help pull you away from having
to use heuristics,
>>>>>> but give you some more opportunities for finding
repeated instruction
>>>>>> sequences. I’ve thought of doing something like this in
the future with the
>>>>>> MachineOutliner and seeing how it goes.
>>>>>>
>>>>>> - Jessica
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/99530c30/attachment-0001.html>

Joerg Sonnenberger via llvm-dev

2017-Jul-25 12:16 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Mon, Jul 24, 2017 at 09:02:55PM -0700, Sean Silva via llvm-dev
wrote:> Do we have a way to get an instruction's exact encoded length for the
MIR
> outliner?
I don't think we have to go that far. A heuristic based on the immediate
argument should be enough.

Joerg

Quentin Colombet via llvm-dev

2017-Jul-25 16:21 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Thanks River for the further explanations!
> On Jul 24, 2017, at 8:31 PM, River Riddle <riddleriver at gmail.com>
wrote:
> 
> Hey Quentin,
>  To clarify some of your concerns/questions:
> - Currently parameter/output passing are based around the simple cost=1,
though this is something that will need to change a bit, as you have pointed
out. As I said, the cost model isn't perfect.
> - The output that is turned into a return is either: one that will remove
an output parameter from the function call, or the last output in the
instruction sequence.
> - As for being conservative with the cost, we prefer to overestimate the
cost of outlining(New function cost) and under estimate the cost of the
instruction sequence when possible. This means that we tend to under estimate
the benefit, given that we are working with approximate costs.
> - When I speak of regressions I am referring to the size of the text
section.
> - The gist of identifying semantic equivalence is the very first sentence
of Candidate Selection in the initial RFC. We identify equivalence generally via
Type,Opcode, operand types, any special state. This is just the current
definition, we can relax this even further if we want to.
The thing is with that definition, I fail to see how we can differentiate say
add x,x from add x,y. How does that work?
> - Working at the IR level has more target independent advantages than just
reducing the amount of new hooks. This can be seen in the fact that: we
don't require mno-red-zone or any other abi flag/check, we can easily
parameterize sequences, generate outputs from sequences, etc.
> - By combining work I mean that the ability for adding advancements are
much easier. if someone wants to add functionality for outlining regions they
don't have to worry about any target specific patchups. It just works, for
every target.
> Both outliners require hooks to get better performance but with IR outliner
these hooks are limited to the cost model and not the outlining process itself.
> Even though we disagree, I do appreciate the discussion and feedback!
>  Thanks,
> River Riddle
> 
> On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
> Hi River,
> 
> Thanks for the detailed explanation.
> 
> If people are okay for you to move forward, like I said to Andrey, I won’t
oppose. I feel sad we have to split our effort on outlining technology, but I
certainly don’t pretend to know what is best!
> The bottom line is if people are happy with that going in, the conversation
on the details can continue in parallel.
> 
>> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com
<mailto:riddleriver at gmail.com>> wrote:
>> 
>> Hey Quentin,
>>  Sorry I missed the last question. Currently it doesn't do
continual outlining, but it's definitely a possibility.
> 
> Ok, that means we probably won’t have the very bad runtime problems I had
in mind with adding a lot of indirections.
> 
>> Thanks,
>> River Riddle
>> 
>> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at
gmail.com <mailto:riddleriver at gmail.com>> wrote:
>> Hi Quentin,
>>   I understand your points and I believe that some meaning is being
lost via email. For performance it's true that that cost isn't
necessarily modeled, there is currently only support for using profile data to
avoid mitigate that. I was working under the assumption, possibly incorrectly,
that at Oz we favor small code over anything else including runtime performance.
This is why we only run at Os if we have profile data, and even then we are very
conservative about what we outline from.
>>   I also understand that some target hooks may be required for certain
things, it happens for other IR level passes as well. I just want to minimize
that behavior as much as I can, though thats a personal choice.
>> As for a motivating reason to push for this, it actually solves some of
the problems that you brought up in the post for the original Machine Outliner
RFC. If I can quote you:
>> 
>> "So basically, better cost model. That's one part of the
story.
>> 
>> The other part is at the LLVM IR level or before register allocation
identifying similar code sequence is much harder, at least with a suffix tree
like algorithm. Basically the problem is how do we name our instructions such
that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f; 
>> }
>> 
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>> 
>> With proper renaming, we can outline both adds in one function. The
difficulty is to recognize that they are semantically equivalent to give them
the same identifier in the suffix tree. I won’t get into the details but it gets
tricky quickly. We were thinking of reusing GVN to have such identifier if we
wanted to do the outlining at IR level but solving this problem is hard."
>> 
>> This outliner will catch your case, it is actually one of the easiest
cases for it to catch. The outliner can recognize semantically equivalent
instructions and can be expanded to catch even more.
> 
> Interesting, could you explain how you do that?
> I didn’t see it in the original post.
> 
>> 
>> As for the cost model it is quite simple:
>>  - We identify all of the external inputs into the sequence. For
estimating the benefit we constant fold and condense the inputs so that we can
get the set of *unique* inputs into the sequence.
> 
> Ok, those are your parameters. How do you account for the cost of setting
up those parameters?
> 
>>  - We also identify outputs from the sequence, instructions that have
external references. We add the cost of storing/loading/passing output parameter
to the outlined function.
> 
> Ok, those are your return values (or your output parameter). I see the cost
computation you do on those, but it still miss the general parameter setup cost.
> 
>>  - We identify one output to promote to a return value for the
function. This can end up reducing an output parameter that is passed to our
outlined function.
> 
> How do you choose that one?
> 
>>  - We estimate the cost of the sequence of instructions by mostly using
TTI.
> 
> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
> 
>>  - With the above we can estimate the cost per occurrence as well as
the cost for the new function. Of course these are mostly estimates, we lean
towards the conservative side to be safe.
> 
> Conservative in what sense? Put differently how do you know your cost is
conservative?
> 
>>  - Finally we can compute an estimated benefit for the sequence taking
into account benefit per occurrence and the estimated cost of the new function.
> 
> Make sense.
> 
>> 
>> There is definitely room for improvement in the cost model. I do not
believe its the best but its shown to be somewhat reliable for most cases. It
has benefits and it has regressions, as does the machine outliner.
> 
> Regressions in what sense?
> Do you actually have functions that are bigger?
> 
> To clarify, AFAIR, the machine outliner does not have such regressions per
say. The outliner does perfectly what the cost model predicted: functions are
individually smaller. Code size grow may happen because of padding in the object
file (and getting in the way of some linker optimization).
> 
>> The reasoning for adding the new framework, is because I believe that
this transformation should exist at the IR level. Not only because it is the
simplest place to put it but also because it provides the greatest opportunities
for improvement with the largest coverage.
> 
> Again I don’t get those arguments. Machine passes can be target
independent. Sure they may require the targets to adopt new hooks but that’s
about it and I don’t see how this is different from having to add adopt hooks
from TTI. I am guessing you can mostly achieve your goals with the existing TTI
hooks, but the arguments passing ones, which is the part I missed in your
explanation :).
> 
>> We can expand the framework to catch Danny's cases from the machine
outliner RFC, we can add region outlining, etc. We can do all of this and make
it available to every target automatically. The reason why I am for this being
at the IR level is not because I want to split up the technical effort, but
combine it.
> 
> I fail to see the combining part here :).
> 
> To give you a bit context, I am afraid that, indeed, the implementation is
going to be simpler (partly because a lot of people are more familiar with LLVM
IR than MachineInstr) and for the most part TTI is going to be good enough.
However, when we will want to go the extra mile and do crazy stuff, we will,
like with the vectorizer IMHO, hit that wall of TTI  abstractions that we are
going to work around in various ways. I’d like, if at all possible, to avoid
repeating the history here.
> 
> Cheers,
> -Quentin
> 
>> 
>> Thanks,
>>   River Riddle
>> 
>> 
>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>> Hi River,
>> 
>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com <mailto:riddleriver at gmail.com>> wrote:
>>> 
>>> Hi Quentin,
>>>  I appreciate the feedback. When I reference the cost of Target
Hooks it's mainly for maintainability and cost on a target author. We want
to keep the intrusion into target information minimized. The heuristics used for
the outliner are the same used by any other IR level pass seeking target
information, i.e TTI for the most part. I can see where you are coming from with
"having heuristics solely focused on code size do not seem realistic",
but I don't agree with that statement.
>> 
>> If you only want code size I agree it makes sense, but I believe, even
in Oz, we probably don’t want to slow the code by a big factor for a couple
bytes. That’s what I wanted to say and what I wanted to point out is that you
need to have some kind of model for the performance to avoid those worst cases.
Unless we don’t care :).
>> 
>>> I think there is a disconnect on heuristics. The only user tunable
parameters are the lower bound parameters(to the cost model), the actual
analysis(heuristic calculation) is based upon TTI information.
>> 
>> I don’t see how you can get around adding more hooks to know how a
specific function prototype is going to be lowered (e.g., i64 needs to be split
into two registers, fourth and onward parameters need to be pushed on the stack
and so on). Those change the code size benefit.
>> 
>>>  When you say "Would still be interesting to see how well this
could perform on some exact model (i.e., at the Machine level), IMO." I am
slightly confused as to what you mean. I do not intend to try and implement this
algorithm at the MIR level given that it exists in Machine Outliner.
>> 
>> Of course, I don’t expect you to do that :). What I meant is that the
claim that IR offers the better trade off is not based on hard evidences. I
actually don’t buy it.
>> 
>> My point was to make sure I understand what you are trying to solve and
given you have mentioned the MachineOutliner, why you are not working on
improving it instead of suggesting a new framework.
>> Don’t take me wrong, maybe creating a new framework at the IR level is
the right thing to do, but I still didn’t get that from your comments.
>> 
>>> There are several comparison benchmarks given in the "More
detailed performance data" of the original RFC. It includes comparisons to
the Machine Outliner when possible(I can't build clang on Linux with Machine
Outliner). I welcome any and all discussion on the placement of the outliner in
LLVM.
>> 
>> My fear with a new framework is that we are going to split the effort
for pushing the outliner technology forward and I’d like to avoid that if at all
possible.
>> 
>> Now, to be more concrete on your proposal, could you describe the cost
model for deciding what to outline? (Really the cost model, not the suffix
algo.)
>> Are outlined functions pushed into the list candidates for further
outlining?
>> 
>> Cheers,
>> -Quentin
>> 
>>>  Thanks,
>>> River Riddle
>>> 
>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> Hi River,
>>> 
>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> Hi Jessica,
>>>>  The comparison to the inliner is an interesting one but we
think it's important to note the difference in the use of heuristics. The
inliner is juggling many different tasks at the same time, execution speed, code
size, etc. which can cause the parameters to be very sensitive depending on the
benchmark/platform/etc. The outliners heuristics are focused solely on the
potential code size savings from outlining, and is thus only sensitive to the
current platform. This only creates a problem when we are over estimating the
potential cost of a set of instructions for a particular target. The cost model
parameters are only minimums: instruction sequence length, estimated benefit,
occurrence amount. The heuristics themselves are conservative and based upon all
of the target information available at the IR level, the parameters are just
setting a lower bound to weed out any outliers. You are correct in that being at
the machine level, before or after RA, will give the most accurate heuristics
but we feel there's an advantage to being at the IR level. At the IR level
we can do so many more things that are either too difficult/complex for the
machine level(e.g parameterization/outputs/etc). Not only can we do these things
but they are available on all targets immediately, without the need for target
hooks. The caution on the use of heuristics is understandable, but there comes a
point when trade offs need to be made. We made the trade off for a loss in exact
cost modeling to gain flexibility, coverage, and potential for further features.
This trade off is the same made for quite a few IR level optimizations,
including inlining. As for the worry about code size regressions, so far the
results seem to support our hypothesis.
>>> 
>>> Would still be interesting to see how well this could perform on
some exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
choosing an implementation because it is simpler might not be the right long
term solution.
>>> At the very least, to know what trade-off we are making, having
prototypes with the different approaches sounds sensible.
>>> In particular, all the heuristics about cost for parameter passing
(haven’t checked how you did it) sounds already complex enough and would require
target hooks. Therefore, I am not seeing a clear win with an IR approach here.
>>> 
>>> Finally, having heuristics solely focused on code size do not seem
realistic. Indeed, I am guessing you have some thresholds to avoid outlining
some piece of code too small that would end up adding a whole lot of
indirections and I don’t like magic numbers in general :).
>>> 
>>> To summarize, I wanted to point out that an IR approach is not as a
clear win as you describe and would thus deserve more discussion.
>>> 
>>> Cheers,
>>> -Quentin
>>> 
>>>>  Thanks,
>>>> River Riddle
>>>> 
>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette
<jpaquette at apple.com <mailto:jpaquette at apple.com>> wrote:
>>>> 
>>>> Hi River,
>>>> 
>>>> I’m working on the MachineOutliner pass at the MIR level.
Working at the IR level sounds interesting! It also seems like our algorithms
are similar. I was thinking of taking the suffix array route with the
MachineOutliner in the future.
>>>> 
>>>> Anyway, I’d like to ask about this:
>>>> 
>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>>> 
>>>>> The downside to having this type of transformation be at
the IR level is it means there will be less accuracy in the cost model -  we can
somewhat accurately model the cost per instruction but we can’t get information
on how a window of instructions may lower. This can cause regressions depending
on the platform/codebase, therefore to help alleviate this there are several
tunable parameters for the cost model.
>>>> 
>>>> The inliner is threshold-based and it can be rather
unpredictable how it will impact the code size of a program. Do you have any
idea as to how heuristics/thresholds/parameters could be tuned to prevent this?
In my experience, making good code size decisions with these sorts of passes
requires a lot of knowledge about what instructions you’re dealing with exactly.
I’ve seen the inliner cause some pretty serious code size regressions in
projects due to small changes to the cost model/parameters which cause
improvements in other projects. I’m a little worried that an IR-level outliner
for code size would be doomed to a similar fate.
>>>> 
>>>> Perhaps it would be interesting to run this sort of pass
pre-register allocation? This would help pull you away from having to use
heuristics, but give you some more opportunities for finding repeated
instruction sequences. I’ve thought of doing something like this in the future
with the MachineOutliner and seeing how it goes.
>>>> 
>>>> - Jessica
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>> 
>>> 
>> 
>> 
>> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170725/b2a284b7/attachment-0001.html>

Quentin Colombet via llvm-dev

2017-Jul-25 16:26 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Sean,
> On Jul 24, 2017, at 9:02 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> 
> 
> On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Hi River,
> 
> Thanks for the detailed explanation.
> 
> If people are okay for you to move forward, like I said to Andrey, I won’t
oppose. I feel sad we have to split our effort on outlining technology, but I
certainly don’t pretend to know what is best!
> The bottom line is if people are happy with that going in, the conversation
on the details can continue in parallel.
> 
>> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com
<mailto:riddleriver at gmail.com>> wrote:
>> 
>> Hey Quentin,
>>  Sorry I missed the last question. Currently it doesn't do
continual outlining, but it's definitely a possibility.
> 
> Ok, that means we probably won’t have the very bad runtime problems I had
in mind with adding a lot of indirections.
> 
>> Thanks,
>> River Riddle
>> 
>> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at
gmail.com <mailto:riddleriver at gmail.com>> wrote:
>> Hi Quentin,
>>   I understand your points and I believe that some meaning is being
lost via email. For performance it's true that that cost isn't
necessarily modeled, there is currently only support for using profile data to
avoid mitigate that. I was working under the assumption, possibly incorrectly,
that at Oz we favor small code over anything else including runtime performance.
This is why we only run at Os if we have profile data, and even then we are very
conservative about what we outline from.
>>   I also understand that some target hooks may be required for certain
things, it happens for other IR level passes as well. I just want to minimize
that behavior as much as I can, though thats a personal choice.
>> As for a motivating reason to push for this, it actually solves some of
the problems that you brought up in the post for the original Machine Outliner
RFC. If I can quote you:
>> 
>> "So basically, better cost model. That's one part of the
story.
>> 
>> The other part is at the LLVM IR level or before register allocation
identifying similar code sequence is much harder, at least with a suffix tree
like algorithm. Basically the problem is how do we name our instructions such
that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f; 
>> }
>> 
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>> 
>> With proper renaming, we can outline both adds in one function. The
difficulty is to recognize that they are semantically equivalent to give them
the same identifier in the suffix tree. I won’t get into the details but it gets
tricky quickly. We were thinking of reusing GVN to have such identifier if we
wanted to do the outlining at IR level but solving this problem is hard."
>> 
>> This outliner will catch your case, it is actually one of the easiest
cases for it to catch. The outliner can recognize semantically equivalent
instructions and can be expanded to catch even more.
> 
> Interesting, could you explain how you do that?
> I didn’t see it in the original post.
> 
>> 
>> As for the cost model it is quite simple:
>>  - We identify all of the external inputs into the sequence. For
estimating the benefit we constant fold and condense the inputs so that we can
get the set of *unique* inputs into the sequence.
> 
> Ok, those are your parameters. How do you account for the cost of setting
up those parameters?
> 
>>  - We also identify outputs from the sequence, instructions that have
external references. We add the cost of storing/loading/passing output parameter
to the outlined function.
> 
> Ok, those are your return values (or your output parameter). I see the cost
computation you do on those, but it still miss the general parameter setup cost.
> 
>>  - We identify one output to promote to a return value for the
function. This can end up reducing an output parameter that is passed to our
outlined function.
> 
> How do you choose that one?
> 
>>  - We estimate the cost of the sequence of instructions by mostly using
TTI.
> 
> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
> 
>>  - With the above we can estimate the cost per occurrence as well as
the cost for the new function. Of course these are mostly estimates, we lean
towards the conservative side to be safe.
> 
> Conservative in what sense? Put differently how do you know your cost is
conservative?
> 
>>  - Finally we can compute an estimated benefit for the sequence taking
into account benefit per occurrence and the estimated cost of the new function.
> 
> Make sense.
> 
>> 
>> There is definitely room for improvement in the cost model. I do not
believe its the best but its shown to be somewhat reliable for most cases. It
has benefits and it has regressions, as does the machine outliner.
> 
> Regressions in what sense?
> Do you actually have functions that are bigger?
> 
> To clarify, AFAIR, the machine outliner does not have such regressions per
say. The outliner does perfectly what the cost model predicted: functions are
individually smaller. Code size grow may happen because of padding in the object
file (and getting in the way of some linker optimization).
> 
> 
> The current MIR outliner doesn't take into account instruction encoding
length either. Considering that on x86 instructions can commonly be both 1 and
10+ bytes long, the variability from not modeling that is probably comparable to
the inaccuracy of estimating a fixed cost for an LLVM IR instruction w.r.t. its
lowering to machine code.
Right, I forgot about x86, AArch64 being the primary target when we did that :).
Frankly, there is nothing hard doing the estimate on the actual encoding at this
stage of the IR. The reasons why we didn’t do it were:
1. AArch64 does not have differences
2. It takes some (compile) time to compute that information and given we didn’t
need it for AArch64 we didn’t do it.
> 
> As a simple example, many commonly used x86 instructions encode as over 5
bytes, which is the size of a call instruction. So an outlined function that
consists of a single instruction can be profitable. IIRC there was about 5% code
size saving just from outlining single instructions (that encode at >5 bytes)
at machine code level on the test case I looked at (I mentined it in a comment
on one of Jessica's patches IIRC).
Good point!
> 
> Do we have a way to get an instruction's exact encoded length for the
MIR outliner?
Not right now, IIRC, but I would say that’s half a day effort at most. To be
fair, I haven’t followed the development of the code base since Jessica’s
internship. Jessica would know the details for sure.

Cheers,
-Quentin
> 
> -- Sean Silva
>  -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170725/6907d8f4/attachment.html>

llvm dev - Jul 2017 - [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.