thr3ads.net - llvm dev - [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Sean Silva via llvm-dev

2017-Jul-25 05:25 UTC

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi River,
>
> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com>
wrote:
>
> Hi Quentin,
>  I appreciate the feedback. When I reference the cost of Target Hooks
it's
> mainly for maintainability and cost on a target author. We want to keep the
> intrusion into target information minimized. The heuristics used for the
> outliner are the same used by any other IR level pass seeking target
> information, i.e TTI for the most part. I can see where you are coming from
> with "having heuristics solely focused on code size do not seem
> realistic", but I don't agree with that statement.
>
>
> If you only want code size I agree it makes sense, but I believe, even in
> Oz, we probably don’t want to slow the code by a big factor for a couple
> bytes. That’s what I wanted to say and what I wanted to point out is that
> you need to have some kind of model for the performance to avoid those
> worst cases. Unless we don’t care :).
>
> I think there is a disconnect on heuristics. The only user tunable
> parameters are the lower bound parameters(to the cost model), the actual
> analysis(heuristic calculation) is based upon TTI information.
>
>
> I don’t see how you can get around adding more hooks to know how a
> specific function prototype is going to be lowered (e.g., i64 needs to be
> split into two registers, fourth and onward parameters need to be pushed on
> the stack and so on). Those change the code size benefit.
>
>  When you say "Would still be interesting to see how well this could
> perform on some exact model (i.e., at the Machine level), IMO." I am
> slightly confused as to what you mean. I do not intend to try and implement
> this algorithm at the MIR level given that it exists in Machine Outliner.
>
>
> Of course, I don’t expect you to do that :). What I meant is that the
> claim that IR offers the better trade off is not based on hard evidences. I
> actually don’t buy it.
>
> My point was to make sure I understand what you are trying to solve and
> given you have mentioned the MachineOutliner, why you are not working on
> improving it instead of suggesting a new framework.
> Don’t take me wrong, maybe creating a new framework at the IR level is the
> right thing to do, but I still didn’t get that from your comments.
>
> There are several comparison benchmarks given in the "More detailed
> performance data" of the original RFC. It includes comparisons to the
> Machine Outliner when possible(I can't build clang on Linux with
Machine
> Outliner). I welcome any and all discussion on the placement of the
> outliner in LLVM.
>
>
> My fear with a new framework is that we are going to split the effort for
> pushing the outliner technology forward and I’d like to avoid that if at
> all possible.
>
The two passes are pretty different in their approaches to congruency
finding, so I don't think it helps to group them as though they were
interchangeable "outliner technology". The two passes might be totally
orthogonal.

I can imagine two extreme outcomes (reality is probably somewhere in
between):

1. if you run the LLVM IR level code size outliner, then the
MachineOutliner fails to find any significant redundancies.

2. if you run the LLVM IR level code size outliner, then the
MachineOutliner finds just as many redundancies.

It would be good if River could get some data about this.

-- Sean Silva

>
> Now, to be more concrete on your proposal, could you describe the cost
> model for deciding what to outline? (Really the cost model, not the suffix
> algo.)
> Are outlined functions pushed into the list candidates for further
> outlining?
>
> Cheers,
> -Quentin
>
>  Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
>> Hi River,
>>
>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi Jessica,
>>  The comparison to the inliner is an interesting one but we think
it's
>> important to note the difference in the use of heuristics. The inliner
is
>> juggling many different tasks at the same time, execution speed, code
size,
>> etc. which can cause the parameters to be very sensitive depending on
the
>> benchmark/platform/etc. The outliners heuristics are focused solely on
the
>> potential code size savings from outlining, and is thus only sensitive
to
>> the current platform. This only creates a problem when we are over
>> estimating the potential cost of a set of instructions for a particular
>> target. The cost model parameters are only minimums: instruction
sequence
>> length, estimated benefit, occurrence amount. The heuristics themselves
are
>> conservative and based upon all of the target information available at
the
>> IR level, the parameters are just setting a lower bound to weed out any
>> outliers. You are correct in that being at the machine level, before or
>> after RA, will give the most accurate heuristics but we feel
there's an
>> advantage to being at the IR level. At the IR level we can do so many
more
>> things that are either too difficult/complex for the machine level(e.g
>> parameterization/outputs/etc). Not only can we do these things but they
are
>> available on all targets immediately, without the need for target
hooks.
>> The caution on the use of heuristics is understandable, but there comes
a
>> point when trade offs need to be made. We made the trade off for a loss
in
>> exact cost modeling to gain flexibility, coverage, and potential for
>> further features. This trade off is the same made for quite a few IR
level
>> optimizations, including inlining. As for the worry about code size
>> regressions, so far the results seem to support our hypothesis.
>>
>>
>> Would still be interesting to see how well this could perform on some
>> exact model (i.e., at the Machine level), IMO. Target hooks are cheap
and
>> choosing an implementation because it is simpler might not be the right
>> long term solution.
>> At the very least, to know what trade-off we are making, having
>> prototypes with the different approaches sounds sensible.
>> In particular, all the heuristics about cost for parameter passing
>> (haven’t checked how you did it) sounds already complex enough and
would
>> require target hooks. Therefore, I am not seeing a clear win with an IR
>> approach here.
>>
>> Finally, having heuristics solely focused on code size do not seem
>> realistic. Indeed, I am guessing you have some thresholds to avoid
>> outlining some piece of code too small that would end up adding a whole
lot
>> of indirections and I don’t like magic numbers in general :).
>>
>> To summarize, I wanted to point out that an IR approach is not as a
clear
>> win as you describe and would thus deserve more discussion.
>>
>> Cheers,
>> -Quentin
>>
>>  Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at
apple.com>
>> wrote:
>>
>>>
>>> Hi River,
>>>
>>> I’m working on the MachineOutliner pass at the MIR level. Working
at the
>>> IR level sounds interesting! It also seems like our algorithms are
similar.
>>> I was thinking of taking the suffix array route with the
MachineOutliner in
>>> the future.
>>>
>>> Anyway, I’d like to ask about this:
>>>
>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>> The downside to having this type of transformation be at the IR
level is
>>> it means there will be less accuracy in the cost model -  we can
somewhat
>>> accurately model the cost per instruction but we can’t get
information on
>>> how a window of instructions may lower. This can cause regressions
>>> depending on the platform/codebase, therefore to help alleviate
this there
>>> are several tunable parameters for the cost model.
>>>
>>>
>>> The inliner is threshold-based and it can be rather unpredictable
how it
>>> will impact the code size of a program. Do you have any idea as to
how
>>> heuristics/thresholds/parameters could be tuned to prevent this? In
my
>>> experience, making good code size decisions with these sorts of
passes
>>> requires a lot of knowledge about what instructions you’re dealing
with
>>> exactly. I’ve seen the inliner cause some pretty serious code size
>>> regressions in projects due to small changes to the cost
model/parameters
>>> which cause improvements in other projects. I’m a little worried
that an
>>> IR-level outliner for code size would be doomed to a similar fate.
>>>
>>> Perhaps it would be interesting to run this sort of pass
pre-register
>>> allocation? This would help pull you away from having to use
heuristics,
>>> but give you some more opportunities for finding repeated
instruction
>>> sequences. I’ve thought of doing something like this in the future
with the
>>> MachineOutliner and seeing how it goes.
>>>
>>> - Jessica
>>>
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/ff0cdbf3/attachment.html>

Jessica Paquette via llvm-dev

2017-Jul-25 16:24 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

> The two passes are pretty different in their approaches to congruency
finding, so I don't think it helps to group them as though they were
interchangeable "outliner technology". The two passes might be totally
orthogonal.
I think that based off how River described his algorithm, the actual underlying
method should be pretty similar. If it is, then we could probably compare the
performance of the suffix tree vs. suffix array method and then create a general
outliner that can be run at any level of representation. By that I mean that the
actual suffix tree/array candidate search algorithm would be separate from the
actual implementation of *outlining things*. The actual implementation at each
level of representation would define

- How to outline a sequence of instructions
- An equivalence/congruence scheme for two instructions
- A cost model

I don’t think it’d be too difficult to split it up in that sort of way. I’d also
like to experiment with pre-regalloc outlining in general, so it’d make it
easier to explore that route as well.
> 1. if you run the LLVM IR level code size outliner, then the
MachineOutliner fails to find any significant redundancies.
> 
> 2. if you run the LLVM IR level code size outliner, then the
MachineOutliner finds just as many redundancies.
This would be interesting. The MachineOutliner deems two instructions equivalent
iff their opcodes and operands are identical. This isn’t super flexible
(pre-register allocation outlining would be much better). However, an IR-level
outliner has a lot more wiggle-room. It would be interesting to see how certain
equivalence schemes perform. Once again, making assumptions about River’s
algorithm, but the equivalence/congruence schemes are probably where the most
significant differences in the way each technique handles outlining lie.

- Jessica
> On Jul 24, 2017, at 10:25 PM, Sean Silva via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Hi River,
> 
>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com
<mailto:riddleriver at gmail.com>> wrote:
>> 
>> Hi Quentin,
>>  I appreciate the feedback. When I reference the cost of Target Hooks
it's mainly for maintainability and cost on a target author. We want to keep
the intrusion into target information minimized. The heuristics used for the
outliner are the same used by any other IR level pass seeking target
information, i.e TTI for the most part. I can see where you are coming from with
"having heuristics solely focused on code size do not seem realistic",
but I don't agree with that statement.
> 
> If you only want code size I agree it makes sense, but I believe, even in
Oz, we probably don’t want to slow the code by a big factor for a couple bytes.
That’s what I wanted to say and what I wanted to point out is that you need to
have some kind of model for the performance to avoid those worst cases. Unless
we don’t care :).
> 
>> I think there is a disconnect on heuristics. The only user tunable
parameters are the lower bound parameters(to the cost model), the actual
analysis(heuristic calculation) is based upon TTI information.
> 
> I don’t see how you can get around adding more hooks to know how a specific
function prototype is going to be lowered (e.g., i64 needs to be split into two
registers, fourth and onward parameters need to be pushed on the stack and so
on). Those change the code size benefit.
> 
>>  When you say "Would still be interesting to see how well this
could perform on some exact model (i.e., at the Machine level), IMO." I am
slightly confused as to what you mean. I do not intend to try and implement this
algorithm at the MIR level given that it exists in Machine Outliner.
> 
> Of course, I don’t expect you to do that :). What I meant is that the claim
that IR offers the better trade off is not based on hard evidences. I actually
don’t buy it.
> 
> My point was to make sure I understand what you are trying to solve and
given you have mentioned the MachineOutliner, why you are not working on
improving it instead of suggesting a new framework.
> Don’t take me wrong, maybe creating a new framework at the IR level is the
right thing to do, but I still didn’t get that from your comments.
> 
>> There are several comparison benchmarks given in the "More
detailed performance data" of the original RFC. It includes comparisons to
the Machine Outliner when possible(I can't build clang on Linux with Machine
Outliner). I welcome any and all discussion on the placement of the outliner in
LLVM.
> 
> My fear with a new framework is that we are going to split the effort for
pushing the outliner technology forward and I’d like to avoid that if at all
possible.
> 
> The two passes are pretty different in their approaches to congruency
finding, so I don't think it helps to group them as though they were
interchangeable "outliner technology". The two passes might be totally
orthogonal.
> 
> I can imagine two extreme outcomes (reality is probably somewhere in
between):
> 
> 1. if you run the LLVM IR level code size outliner, then the
MachineOutliner fails to find any significant redundancies.
> 
> 2. if you run the LLVM IR level code size outliner, then the
MachineOutliner finds just as many redundancies.
> 
> It would be good if River could get some data about this.
> 
> -- Sean Silva
>  
> 
> Now, to be more concrete on your proposal, could you describe the cost
model for deciding what to outline? (Really the cost model, not the suffix
algo.)
> Are outlined functions pushed into the list candidates for further
outlining?
> 
> Cheers,
> -Quentin
> 
>>  Thanks,
>> River Riddle
>> 
>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>> Hi River,
>> 
>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>> 
>>> Hi Jessica,
>>>  The comparison to the inliner is an interesting one but we think
it's important to note the difference in the use of heuristics. The inliner
is juggling many different tasks at the same time, execution speed, code size,
etc. which can cause the parameters to be very sensitive depending on the
benchmark/platform/etc. The outliners heuristics are focused solely on the
potential code size savings from outlining, and is thus only sensitive to the
current platform. This only creates a problem when we are over estimating the
potential cost of a set of instructions for a particular target. The cost model
parameters are only minimums: instruction sequence length, estimated benefit,
occurrence amount. The heuristics themselves are conservative and based upon all
of the target information available at the IR level, the parameters are just
setting a lower bound to weed out any outliers. You are correct in that being at
the machine level, before or after RA, will give the most accurate heuristics
but we feel there's an advantage to being at the IR level. At the IR level
we can do so many more things that are either too difficult/complex for the
machine level(e.g parameterization/outputs/etc). Not only can we do these things
but they are available on all targets immediately, without the need for target
hooks. The caution on the use of heuristics is understandable, but there comes a
point when trade offs need to be made. We made the trade off for a loss in exact
cost modeling to gain flexibility, coverage, and potential for further features.
This trade off is the same made for quite a few IR level optimizations,
including inlining. As for the worry about code size regressions, so far the
results seem to support our hypothesis.
>> 
>> Would still be interesting to see how well this could perform on some
exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
choosing an implementation because it is simpler might not be the right long
term solution.
>> At the very least, to know what trade-off we are making, having
prototypes with the different approaches sounds sensible.
>> In particular, all the heuristics about cost for parameter passing
(haven’t checked how you did it) sounds already complex enough and would require
target hooks. Therefore, I am not seeing a clear win with an IR approach here.
>> 
>> Finally, having heuristics solely focused on code size do not seem
realistic. Indeed, I am guessing you have some thresholds to avoid outlining
some piece of code too small that would end up adding a whole lot of
indirections and I don’t like magic numbers in general :).
>> 
>> To summarize, I wanted to point out that an IR approach is not as a
clear win as you describe and would thus deserve more discussion.
>> 
>> Cheers,
>> -Quentin
>> 
>>>  Thanks,
>>> River Riddle
>>> 
>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at
apple.com <mailto:jpaquette at apple.com>> wrote:
>>> 
>>> Hi River,
>>> 
>>> I’m working on the MachineOutliner pass at the MIR level. Working
at the IR level sounds interesting! It also seems like our algorithms are
similar. I was thinking of taking the suffix array route with the
MachineOutliner in the future.
>>> 
>>> Anyway, I’d like to ask about this:
>>> 
>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> The downside to having this type of transformation be at the IR
level is it means there will be less accuracy in the cost model -  we can
somewhat accurately model the cost per instruction but we can’t get information
on how a window of instructions may lower. This can cause regressions depending
on the platform/codebase, therefore to help alleviate this there are several
tunable parameters for the cost model.
>>> 
>>> The inliner is threshold-based and it can be rather unpredictable
how it will impact the code size of a program. Do you have any idea as to how
heuristics/thresholds/parameters could be tuned to prevent this? In my
experience, making good code size decisions with these sorts of passes requires
a lot of knowledge about what instructions you’re dealing with exactly. I’ve
seen the inliner cause some pretty serious code size regressions in projects due
to small changes to the cost model/parameters which cause improvements in other
projects. I’m a little worried that an IR-level outliner for code size would be
doomed to a similar fate.
>>> 
>>> Perhaps it would be interesting to run this sort of pass
pre-register allocation? This would help pull you away from having to use
heuristics, but give you some more opportunities for finding repeated
instruction sequences. I’ve thought of doing something like this in the future
with the MachineOutliner and seeing how it goes.
>>> 
>>> - Jessica
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
>> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170725/f8599be6/attachment.html>

Quentin Colombet via llvm-dev

2017-Jul-25 16:31 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

> On Jul 25, 2017, at 9:24 AM, Jessica Paquette <jpaquette at
apple.com> wrote:
> 
>> The two passes are pretty different in their approaches to congruency
finding, so I don't think it helps to group them as though they were
interchangeable "outliner technology". The two passes might be totally
orthogonal.
> 
> I think that based off how River described his algorithm, the actual
underlying method should be pretty similar. If it is, then we could probably
compare the performance of the suffix tree vs. suffix array method and then
create a general outliner that can be run at any level of representation.
+1, I believe that would be the ideal outcome.
I can see a split like this:
- Outliner algorithm IR-agnostic (with possibly different mode: suffix tree,
array.)
- Pluggable cost model
- Rewriter
> By that I mean that the actual suffix tree/array candidate search algorithm
would be separate from the actual implementation of *outlining things*. The
actual implementation at each level of representation would define
> 
> - How to outline a sequence of instructions
> - An equivalence/congruence scheme for two instructions
> - A cost model
> 
> I don’t think it’d be too difficult to split it up in that sort of way. I’d
also like to experiment with pre-regalloc outlining in general, so it’d make it
easier to explore that route as well.
> 
>> 1. if you run the LLVM IR level code size outliner, then the
MachineOutliner fails to find any significant redundancies.
>> 
>> 2. if you run the LLVM IR level code size outliner, then the
MachineOutliner finds just as many redundancies.
> 
> This would be interesting. The MachineOutliner deems two instructions
equivalent iff their opcodes and operands are identical. This isn’t super
flexible (pre-register allocation outlining would be much better). However, an
IR-level outliner has a lot more wiggle-room. It would be interesting to see how
certain equivalence schemes perform. Once again, making assumptions about
River’s algorithm, but the equivalence/congruence schemes are probably where the
most significant differences in the way each technique handles outlining lie.
> 
> - Jessica
> 
>> On Jul 24, 2017, at 10:25 PM, Sean Silva via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> 
>> 
>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>> Hi River,
>> 
>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com <mailto:riddleriver at gmail.com>> wrote:
>>> 
>>> Hi Quentin,
>>>  I appreciate the feedback. When I reference the cost of Target
Hooks it's mainly for maintainability and cost on a target author. We want
to keep the intrusion into target information minimized. The heuristics used for
the outliner are the same used by any other IR level pass seeking target
information, i.e TTI for the most part. I can see where you are coming from with
"having heuristics solely focused on code size do not seem realistic",
but I don't agree with that statement.
>> 
>> If you only want code size I agree it makes sense, but I believe, even
in Oz, we probably don’t want to slow the code by a big factor for a couple
bytes. That’s what I wanted to say and what I wanted to point out is that you
need to have some kind of model for the performance to avoid those worst cases.
Unless we don’t care :).
>> 
>>> I think there is a disconnect on heuristics. The only user tunable
parameters are the lower bound parameters(to the cost model), the actual
analysis(heuristic calculation) is based upon TTI information.
>> 
>> I don’t see how you can get around adding more hooks to know how a
specific function prototype is going to be lowered (e.g., i64 needs to be split
into two registers, fourth and onward parameters need to be pushed on the stack
and so on). Those change the code size benefit.
>> 
>>>  When you say "Would still be interesting to see how well this
could perform on some exact model (i.e., at the Machine level), IMO." I am
slightly confused as to what you mean. I do not intend to try and implement this
algorithm at the MIR level given that it exists in Machine Outliner.
>> 
>> Of course, I don’t expect you to do that :). What I meant is that the
claim that IR offers the better trade off is not based on hard evidences. I
actually don’t buy it.
>> 
>> My point was to make sure I understand what you are trying to solve and
given you have mentioned the MachineOutliner, why you are not working on
improving it instead of suggesting a new framework.
>> Don’t take me wrong, maybe creating a new framework at the IR level is
the right thing to do, but I still didn’t get that from your comments.
>> 
>>> There are several comparison benchmarks given in the "More
detailed performance data" of the original RFC. It includes comparisons to
the Machine Outliner when possible(I can't build clang on Linux with Machine
Outliner). I welcome any and all discussion on the placement of the outliner in
LLVM.
>> 
>> My fear with a new framework is that we are going to split the effort
for pushing the outliner technology forward and I’d like to avoid that if at all
possible.
>> 
>> The two passes are pretty different in their approaches to congruency
finding, so I don't think it helps to group them as though they were
interchangeable "outliner technology". The two passes might be totally
orthogonal.
>> 
>> I can imagine two extreme outcomes (reality is probably somewhere in
between):
>> 
>> 1. if you run the LLVM IR level code size outliner, then the
MachineOutliner fails to find any significant redundancies.
>> 
>> 2. if you run the LLVM IR level code size outliner, then the
MachineOutliner finds just as many redundancies.
>> 
>> It would be good if River could get some data about this.
>> 
>> -- Sean Silva
>>  
>> 
>> Now, to be more concrete on your proposal, could you describe the cost
model for deciding what to outline? (Really the cost model, not the suffix
algo.)
>> Are outlined functions pushed into the list candidates for further
outlining?
>> 
>> Cheers,
>> -Quentin
>> 
>>>  Thanks,
>>> River Riddle
>>> 
>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> Hi River,
>>> 
>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> Hi Jessica,
>>>>  The comparison to the inliner is an interesting one but we
think it's important to note the difference in the use of heuristics. The
inliner is juggling many different tasks at the same time, execution speed, code
size, etc. which can cause the parameters to be very sensitive depending on the
benchmark/platform/etc. The outliners heuristics are focused solely on the
potential code size savings from outlining, and is thus only sensitive to the
current platform. This only creates a problem when we are over estimating the
potential cost of a set of instructions for a particular target. The cost model
parameters are only minimums: instruction sequence length, estimated benefit,
occurrence amount. The heuristics themselves are conservative and based upon all
of the target information available at the IR level, the parameters are just
setting a lower bound to weed out any outliers. You are correct in that being at
the machine level, before or after RA, will give the most accurate heuristics
but we feel there's an advantage to being at the IR level. At the IR level
we can do so many more things that are either too difficult/complex for the
machine level(e.g parameterization/outputs/etc). Not only can we do these things
but they are available on all targets immediately, without the need for target
hooks. The caution on the use of heuristics is understandable, but there comes a
point when trade offs need to be made. We made the trade off for a loss in exact
cost modeling to gain flexibility, coverage, and potential for further features.
This trade off is the same made for quite a few IR level optimizations,
including inlining. As for the worry about code size regressions, so far the
results seem to support our hypothesis.
>>> 
>>> Would still be interesting to see how well this could perform on
some exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
choosing an implementation because it is simpler might not be the right long
term solution.
>>> At the very least, to know what trade-off we are making, having
prototypes with the different approaches sounds sensible.
>>> In particular, all the heuristics about cost for parameter passing
(haven’t checked how you did it) sounds already complex enough and would require
target hooks. Therefore, I am not seeing a clear win with an IR approach here.
>>> 
>>> Finally, having heuristics solely focused on code size do not seem
realistic. Indeed, I am guessing you have some thresholds to avoid outlining
some piece of code too small that would end up adding a whole lot of
indirections and I don’t like magic numbers in general :).
>>> 
>>> To summarize, I wanted to point out that an IR approach is not as a
clear win as you describe and would thus deserve more discussion.
>>> 
>>> Cheers,
>>> -Quentin
>>> 
>>>>  Thanks,
>>>> River Riddle
>>>> 
>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette
<jpaquette at apple.com <mailto:jpaquette at apple.com>> wrote:
>>>> 
>>>> Hi River,
>>>> 
>>>> I’m working on the MachineOutliner pass at the MIR level.
Working at the IR level sounds interesting! It also seems like our algorithms
are similar. I was thinking of taking the suffix array route with the
MachineOutliner in the future.
>>>> 
>>>> Anyway, I’d like to ask about this:
>>>> 
>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>>> 
>>>>> The downside to having this type of transformation be at
the IR level is it means there will be less accuracy in the cost model -  we can
somewhat accurately model the cost per instruction but we can’t get information
on how a window of instructions may lower. This can cause regressions depending
on the platform/codebase, therefore to help alleviate this there are several
tunable parameters for the cost model.
>>>> 
>>>> The inliner is threshold-based and it can be rather
unpredictable how it will impact the code size of a program. Do you have any
idea as to how heuristics/thresholds/parameters could be tuned to prevent this?
In my experience, making good code size decisions with these sorts of passes
requires a lot of knowledge about what instructions you’re dealing with exactly.
I’ve seen the inliner cause some pretty serious code size regressions in
projects due to small changes to the cost model/parameters which cause
improvements in other projects. I’m a little worried that an IR-level outliner
for code size would be doomed to a similar fate.
>>>> 
>>>> Perhaps it would be interesting to run this sort of pass
pre-register allocation? This would help pull you away from having to use
heuristics, but give you some more opportunities for finding repeated
instruction sequences. I’ve thought of doing something like this in the future
with the MachineOutliner and seeing how it goes.
>>>> 
>>>> - Jessica
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170725/da446580/attachment-0001.html>

Sean Silva via llvm-dev

2017-Jul-25 22:32 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Tue, Jul 25, 2017 at 9:24 AM, Jessica Paquette <jpaquette at apple.com>
wrote:
> The two passes are pretty different in their approaches to congruency
> finding, so I don't think it helps to group them as though they were
> interchangeable "outliner technology". The two passes might be
totally
> orthogonal.
>
>
> I think that based off how River described his algorithm, the actual
> underlying method should be pretty similar. If it is, then we could
> probably compare the performance of the suffix tree vs. suffix array method
> and then create a general outliner that can be run at any level of
> representation. By that I mean that the actual suffix tree/array candidate
> search algorithm would be separate from the actual implementation of
> *outlining things*. The actual implementation at each level of
> representation would define
>
> - How to outline a sequence of instructions
> - An equivalence/congruence scheme for two instructions
> - A cost model
>
> I don’t think it’d be too difficult to split it up in that sort of way.
> I’d also like to experiment with pre-regalloc outlining in general, so it’d
> make it easier to explore that route as well.
>
There is some commonality, but the suffix tree/array candidate selection is
only the first step in the IR-level outliner in River's patch, and the most
interesting things it does happen after that. The result of the suffix
tree/array candidate selection in River's patch is a set of sequences that
are not guaranteed to be identical because relaxed congruence is used, and
the code has to determine if they are in fact identical and parameterize
any parameterizable differences.

This RFC/review should probably acknowledge the similarities with the
MachineOutliner but actually focus more on the differences. I think we all
agree that just running the current MachineOutliner's algorithm at IR level
is not interesting. And indeed, River's patch isn't that.


There is definitely commonality and potentially reusability for the
"repeated substring" aspect of candidate selection. It probably
wouldn't be
too difficult to factor out the MachineOutliner's suffix tree that operates
on a simple std::vector<unsigned> and reuse that in River's patch,
avoiding
the need for the suffix array implementation (or vice versa, though it's
probably best to move code around in tree at first than introduce new
code). The entire function findOutliningOccurrences in River's patch smells
a lot like MachineOutliner::buildCandidateList+pruneOverlaps. It would
probably make sense to common at least the suffix tree/array in a first
patch, and possibly the overlap pruning too. That would decrease the line
count of CodeSizeOutliner.cpp in the patch and emphasize the differences
from the MachineOutliner.

-- Sean Silva

> 1. if you run the LLVM IR level code size outliner, then the
> MachineOutliner fails to find any significant redundancies.
>
> 2. if you run the LLVM IR level code size outliner, then the
> MachineOutliner finds just as many redundancies.
>
>
> This would be interesting. The MachineOutliner deems two instructions
> equivalent iff their opcodes and operands are identical. This isn’t super
> flexible (pre-register allocation outlining would be much better). However,
> an IR-level outliner has a lot more wiggle-room. It would be interesting to
> see how certain equivalence schemes perform. Once again, making assumptions
> about River’s algorithm, but the equivalence/congruence schemes are
> probably where the most significant differences in the way each technique
> handles outlining lie.
>
> - Jessica
>
> On Jul 24, 2017, at 10:25 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi River,
>>
>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at
gmail.com> wrote:
>>
>> Hi Quentin,
>>  I appreciate the feedback. When I reference the cost of Target Hooks
>> it's mainly for maintainability and cost on a target author. We
want to
>> keep the intrusion into target information minimized. The heuristics
used
>> for the outliner are the same used by any other IR level pass seeking
>> target information, i.e TTI for the most part. I can see where you are
>> coming from with "having heuristics solely focused on code size do
not
>> seem realistic", but I don't agree with that statement.
>>
>>
>> If you only want code size I agree it makes sense, but I believe, even
in
>> Oz, we probably don’t want to slow the code by a big factor for a
couple
>> bytes. That’s what I wanted to say and what I wanted to point out is
that
>> you need to have some kind of model for the performance to avoid those
>> worst cases. Unless we don’t care :).
>>
>> I think there is a disconnect on heuristics. The only user tunable
>> parameters are the lower bound parameters(to the cost model), the
actual
>> analysis(heuristic calculation) is based upon TTI information.
>>
>>
>> I don’t see how you can get around adding more hooks to know how a
>> specific function prototype is going to be lowered (e.g., i64 needs to
be
>> split into two registers, fourth and onward parameters need to be
pushed on
>> the stack and so on). Those change the code size benefit.
>>
>>  When you say "Would still be interesting to see how well this
could
>> perform on some exact model (i.e., at the Machine level), IMO." I
am
>> slightly confused as to what you mean. I do not intend to try and
implement
>> this algorithm at the MIR level given that it exists in Machine
Outliner.
>>
>>
>> Of course, I don’t expect you to do that :). What I meant is that the
>> claim that IR offers the better trade off is not based on hard
evidences. I
>> actually don’t buy it.
>>
>> My point was to make sure I understand what you are trying to solve and
>> given you have mentioned the MachineOutliner, why you are not working
on
>> improving it instead of suggesting a new framework.
>> Don’t take me wrong, maybe creating a new framework at the IR level is
>> the right thing to do, but I still didn’t get that from your comments.
>>
>> There are several comparison benchmarks given in the "More
detailed
>> performance data" of the original RFC. It includes comparisons to
the
>> Machine Outliner when possible(I can't build clang on Linux with
Machine
>> Outliner). I welcome any and all discussion on the placement of the
>> outliner in LLVM.
>>
>>
>> My fear with a new framework is that we are going to split the effort
for
>> pushing the outliner technology forward and I’d like to avoid that if
at
>> all possible.
>>
>
> The two passes are pretty different in their approaches to congruency
> finding, so I don't think it helps to group them as though they were
> interchangeable "outliner technology". The two passes might be
totally
> orthogonal.
>
> I can imagine two extreme outcomes (reality is probably somewhere in
> between):
>
> 1. if you run the LLVM IR level code size outliner, then the
> MachineOutliner fails to find any significant redundancies.
>
> 2. if you run the LLVM IR level code size outliner, then the
> MachineOutliner finds just as many redundancies.
>
> It would be good if River could get some data about this.
>
> -- Sean Silva
>
>
>>
>> Now, to be more concrete on your proposal, could you describe the cost
>> model for deciding what to outline? (Really the cost model, not the
suffix
>> algo.)
>> Are outlined functions pushed into the list candidates for further
>> outlining?
>>
>> Cheers,
>> -Quentin
>>
>>  Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at
apple.com>
>> wrote:
>>
>>> Hi River,
>>>
>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi Jessica,
>>>  The comparison to the inliner is an interesting one but we think
it's
>>> important to note the difference in the use of heuristics. The
inliner is
>>> juggling many different tasks at the same time, execution speed,
code size,
>>> etc. which can cause the parameters to be very sensitive depending
on the
>>> benchmark/platform/etc. The outliners heuristics are focused solely
on the
>>> potential code size savings from outlining, and is thus only
sensitive to
>>> the current platform. This only creates a problem when we are over
>>> estimating the potential cost of a set of instructions for a
particular
>>> target. The cost model parameters are only minimums: instruction
sequence
>>> length, estimated benefit, occurrence amount. The heuristics
themselves are
>>> conservative and based upon all of the target information available
at the
>>> IR level, the parameters are just setting a lower bound to weed out
any
>>> outliers. You are correct in that being at the machine level,
before or
>>> after RA, will give the most accurate heuristics but we feel
there's an
>>> advantage to being at the IR level. At the IR level we can do so
many more
>>> things that are either too difficult/complex for the machine
level(e.g
>>> parameterization/outputs/etc). Not only can we do these things but
they are
>>> available on all targets immediately, without the need for target
hooks.
>>> The caution on the use of heuristics is understandable, but there
comes a
>>> point when trade offs need to be made. We made the trade off for a
loss in
>>> exact cost modeling to gain flexibility, coverage, and potential
for
>>> further features. This trade off is the same made for quite a few
IR level
>>> optimizations, including inlining. As for the worry about code size
>>> regressions, so far the results seem to support our hypothesis.
>>>
>>>
>>> Would still be interesting to see how well this could perform on
some
>>> exact model (i.e., at the Machine level), IMO. Target hooks are
cheap and
>>> choosing an implementation because it is simpler might not be the
right
>>> long term solution.
>>> At the very least, to know what trade-off we are making, having
>>> prototypes with the different approaches sounds sensible.
>>> In particular, all the heuristics about cost for parameter passing
>>> (haven’t checked how you did it) sounds already complex enough and
would
>>> require target hooks. Therefore, I am not seeing a clear win with
an IR
>>> approach here.
>>>
>>> Finally, having heuristics solely focused on code size do not seem
>>> realistic. Indeed, I am guessing you have some thresholds to avoid
>>> outlining some piece of code too small that would end up adding a
whole lot
>>> of indirections and I don’t like magic numbers in general :).
>>>
>>> To summarize, I wanted to point out that an IR approach is not as a
>>> clear win as you describe and would thus deserve more discussion.
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>  Thanks,
>>> River Riddle
>>>
>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at
apple.com>
>>>  wrote:
>>>
>>>>
>>>> Hi River,
>>>>
>>>> I’m working on the MachineOutliner pass at the MIR level.
Working at
>>>> the IR level sounds interesting! It also seems like our
algorithms are
>>>> similar. I was thinking of taking the suffix array route with
the
>>>> MachineOutliner in the future.
>>>>
>>>> Anyway, I’d like to ask about this:
>>>>
>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> The downside to having this type of transformation be at the IR
level
>>>> is it means there will be less accuracy in the cost model -  we
can
>>>> somewhat accurately model the cost per instruction but we can’t
get
>>>> information on how a window of instructions may lower. This can
cause
>>>> regressions depending on the platform/codebase, therefore to
help alleviate
>>>> this there are several tunable parameters for the cost model.
>>>>
>>>>
>>>> The inliner is threshold-based and it can be rather
unpredictable how
>>>> it will impact the code size of a program. Do you have any idea
as to how
>>>> heuristics/thresholds/parameters could be tuned to prevent
this? In my
>>>> experience, making good code size decisions with these sorts of
passes
>>>> requires a lot of knowledge about what instructions you’re
dealing with
>>>> exactly. I’ve seen the inliner cause some pretty serious code
size
>>>> regressions in projects due to small changes to the cost
model/parameters
>>>> which cause improvements in other projects. I’m a little
worried that an
>>>> IR-level outliner for code size would be doomed to a similar
fate.
>>>>
>>>> Perhaps it would be interesting to run this sort of pass
pre-register
>>>> allocation? This would help pull you away from having to use
heuristics,
>>>> but give you some more opportunities for finding repeated
instruction
>>>> sequences. I’ve thought of doing something like this in the
future with the
>>>> MachineOutliner and seeing how it goes.
>>>>
>>>> - Jessica
>>>>
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170725/89f96886/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jul 2017 - [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Maybe Matching Threads