thr3ads.net - llvm dev - [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. [Jul 2017]

If this information is useful, please help other people find it:
Share via:

River Riddle via llvm-dev

2017-Jul-21 17:01 UTC

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More
detailed performance data"  in the end section it includes performance
comparisons to the machine outliner.
 As for the algorithmic approach they are kind of similar.
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine
instrs.
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.

IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two
points(relaxed equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant
folded inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo)
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very
different. Also to note, the IR outliner can be expanded to also outline
from multiple basic blocks. I've already prototyped the infrastructure for
outlining regions, so its possible.

Thanks,
  River Riddle

On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <
Evgeny.Astigeevich at arm.com> wrote:
> Hi River,
>
>
> Do you know LLVM has the MachineOutliner pass (
> lib/CodeGen/MachineOutliner.cpp)?
>
> It would be interesting to compare your approach with it.
>
>
> Thanks,
>
> Evgeny Astigeevich
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
River
> Riddle via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* Thursday, July 20, 2017 11:47:57 PM
> *To:* llvm-dev at lists.llvm.org
> *Subject:* [llvm-dev] [RFC] Add IR level interprocedural outliner for
> code size.
>
>
>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
> working on an interprocedural outlining (code folding) pass for code size
> improvement at the IR level. We hit a couple of use cases that the current
> code size solutions didn’t handle well enough. Outlining is one of the
> avenues that seemed potentially beneficial.
>
> -- Algorithmic Approach --
>
> The general implementation can be explained in stages:
>
> * Candidate Selection:
>
> Each instruction in the module is mapped to a congruence class based upon
> relaxed equivalency constraints. For most instructions this is simply: The
> type, opcode, and operand types. The constraints are tightened for
> instructions with special state that require more exact equivalence (e.g.
> ShuffleVector requires a constant mask vector). Candidates are then found
> by constructing a suffix array and lcp(longest common prefix) array.
> Walking the lcp array gives us congruent chains of instructions within the
> module.
>
> * Candidate Analysis:
>
> A verification step splits candidates that have different internal input
> sequences or incompatible parent function attributes between occurrences.
> An example of incompatible internal inputs sequences is:
>
> X = W + 6;   vs    X = W + 6;
>
> Y = X + 4;            Y = W + 4;
>
> The above two occurrences would need special control flow to exist within
> the same outlined function.
>
> During analysis candidates have their inputs and outputs computed along
> with an estimated benefit from extraction. During input calculation a
> constant folding step removes all inputs that are the same amongst all
> occurrences.
>
> * Candidate Pruning:
>
> Overlapping candidates are pruned with a generic greedy algorithm that
> picks occurrences starting from the most beneficial candidates.
>
> * Candidate Outlining:
>
> Non pruned candidates are then outlined. Outputs from a candidate are
> returned via an output parameter, except for one output that is promoted to
> a return value. During outlining the inputs into the candidate are
> condensed by computing the equivalencies between the arguments at each
> occurrence. An example of this is:
>
> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>
> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>
> In the above, parameters 1 and 3 were found to be equivalent for all
> occurrences, thus the amount of inputs was reduced to 2.
>
> * Debug Info:
>
> Debug information is preserved for the calls to functions which have been
> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
> * Profile Info:
>
> If the pass is running at Os the outliner will only consider cold blocks,
> whereas Oz considers all blocks that are not marked as hot.
>
> * Location in Pipeline:
>
> The pass is currently configured to run very late in the optimization
> pipeline. It is intended to run at Oz but will also run at Os if there is
> profile data available. The pass can optionally be run twice, once before
> function simplification and then again at the default location. This run is
> optional because you are gambling the potential benefits of redundancy
> elimination vs the potential benefits from function simplification. This
> can lead to large benefits or regressions depending on the benchmark
> (though the benefits tend to outnumber the regressions). The performance
> section contains data for both on a variety of benchmarks.
>
> -- Why at the IR level --
>
> The decision to place the outliner at the IR level comes from a couple of
> major factors:
>
> - Desire to be target independent
>
> - Most opportunities for congruency
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
> -- Performance --
>
> More results including clang, llvm-tblgen, and more specific numbers about
> benefits/regressions can be found in the notes section below.
>
> * Size Reduction:
>
> - Test Suite(X86_64):
>
>    - Early+Late outlining provides a geomean of 10.5% reduction over clang
> Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
>
>    - Late outlining provides a geomean of 4.65% reduction, with a largest
> improvement of ~51% and largest regression of ~6.4%.
>
> - Spec 2006(X86_64)
>
>    - Early+Late outlining provides a geomean reduction of 2.08%.
>
>    - Late outlining provides 2.09%.
>
> - CSiBE(AArch64)
>
>    - Early+Late outlining shows geomean reduction of around 3.5%.
>
>    - Late outlining shows 3.1%.
>
> * Compile Time:
>
> Compile time was tested under test-suite with a multisample of 5.
>
> - Early+Late outlining
>
>    - Many improvements with > 40% compile time reduction.
>
>    - Few regressions.
>
> - Late outlining
>
>    - Greatest improvement is ~7.8%
>
>    - Greatest regression is ~4% with a difference of <0.02s
>
> Our explanation for the compile time reduction during early outlining is
> that due to the amount of redundancy reduction early in the optimization
> pipeline there is a reduction in the amount of instruction processing
> during the rest of the compilation.
>
> * Execution Time:
>
> Ran with a multisample of 5.
>
> - Test Suite:
>
>    - Early+Late outlining has many regressions up to 97%. The greatest
> improvement was around 7.5%.
>
>    - Late outlining also has several regressions up to 44% and a greatest
> improvement of around 5.3%.
>
> - Spec:
>
>    - Early+Late has a geomean regression of 3.5%.
>
>    - Late outlining has a geomean regression of 1.25%.
>
> The execution time results are to be expected given that the outliner,
> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
> -- Tested Improvements --
>
> * MergeFunctions:
>
>    - No noticeable benefit.
>
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
> * Input/Output Partitioning:
>
>    -This identifies inputs/outputs that may be folded by splitting a
> candidate. The benefit is minimal for the computations required.
>
> * Similar Candidate Merging:
>
>    - The benefit to be gained is currently not worth the large complexity
> required to catch the desired cases.
>
> -- Potential Improvements --
>
> * Suffix&LCP array construction: The pass currently uses a very basic
> implementation that could be improved. There are definitely faster
> algorithms and some can construct both simultaneously. We will investigate
> this as a potential benefit for compile time in the future.
>
> * Os performance tuning: Under -Os the pass currently only runs on cold
> blocks. Ideally we could expand this to be a little more lax on less
> frequently executed blocks that aren’t cold.
>
> * Candidate Selection: The algorithm currently focuses on the longest
> common sequences. More checks could be added to see if shortening the
> sequence length produces a larger benefit(e.g less inputs/outputs). This
> would likely lead to an increase in compile time but should remain less
> than the baseline.
>
> * Non Exact Functions: The outliner currently ignores functions that do
> not have an exact definition.
>
> -- --
>
> * CSiBE(Code Size Benchmark):
>
> www.csibe.org
>
> * More detailed performance data:
>
> goo.gl/5k6wsP
>
> * Implementation:
>
> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
> CodeSizeOutliner.cpp
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170721/e82cbbc5/attachment-0001.html>

Evgeny Astigeevich via llvm-dev

2017-Jul-24 12:38 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi River,


Thank you for information. Do you have your code in Phabricator?


As I understand your pass always runs after the Inliner. At the moment the
Inliner is too aggressive at Os/Oz. I am trying to tune Inliner heuristics for
Oz. It would be interesting to see if there are any code size improvements when
the Inliner is disabled. This will demonstrate whether the Outliner could
improve code size when duplicated code is not due to the Inliner. Another
problem I have is that C++ programs get code size regressions due my changes in
the inline heuristics. Have  you seen this problem?


Thanks,

Evgeny Astigeevich

________________________________
From: River Riddle <riddleriver at gmail.com>
Sent: Friday, July 21, 2017 6:01:56 PM
To: Evgeny Astigeevich
Cc: llvm-dev at lists.llvm.org; nd; Kristof Beyls; gaborb at inf.u-szeged.hu
Subject: Re: [llvm-dev] [RFC] Add IR level interprocedural outliner for code
size.

Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More
detailed performance data"  in the end section it includes performance
comparisons to the machine outliner.
 As for the algorithmic approach they are kind of similar.
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine instrs.
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.

IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two points(relaxed
equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant folded
inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo)
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very
different. Also to note, the IR outliner can be expanded to also outline from
multiple basic blocks. I've already prototyped the infrastructure for
outlining regions, so its possible.

Thanks,
  River Riddle

On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <Evgeny.Astigeevich at
arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:

Hi River,


Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)?

It would be interesting to compare your approach with it.


Thanks,

Evgeny Astigeevich

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> on behalf of River Riddle via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Sent: Thursday, July 20, 2017 11:47:57 PM
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.


 I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
working on an interprocedural outlining (code folding) pass for code size
improvement at the IR level. We hit a couple of use cases that the current code
size solutions didn’t handle well enough. Outlining is one of the avenues that
seemed potentially beneficial.


-- Algorithmic Approach --

The general implementation can be explained in stages:


* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon
relaxed equivalency constraints. For most instructions this is simply: The type,
opcode, and operand types. The constraints are tightened for instructions with
special state that require more exact equivalence (e.g. ShuffleVector requires a
constant mask vector). Candidates are then found by constructing a suffix array
and lcp(longest common prefix) array. Walking the lcp array gives us congruent
chains of instructions within the module.


* Candidate Analysis:

A verification step splits candidates that have different internal input
sequences or incompatible parent function attributes between occurrences. An
example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within the
same outlined function.


During analysis candidates have their inputs and outputs computed along with an
estimated benefit from extraction. During input calculation a constant folding
step removes all inputs that are the same amongst all occurrences.


* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that picks
occurrences starting from the most beneficial candidates.


* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are returned
via an output parameter, except for one output that is promoted to a return
value. During outlining the inputs into the candidate are condensed by computing
the equivalencies between the arguments at each occurrence. An example of this
is:


outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);


In the above, parameters 1 and 3 were found to be equivalent for all
occurrences, thus the amount of inputs was reduced to 2.


* Debug Info:

Debug information is preserved for the calls to functions which have been
outlined but all debug info from the original outlined portions is removed,
making them harder to debug.


* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks,
whereas Oz considers all blocks that are not marked as hot.


* Location in Pipeline:

The pass is currently configured to run very late in the optimization pipeline.
It is intended to run at Oz but will also run at Os if there is profile data
available. The pass can optionally be run twice, once before function
simplification and then again at the default location. This run is optional
because you are gambling the potential benefits of redundancy elimination vs the
potential benefits from function simplification. This can lead to large benefits
or regressions depending on the benchmark (though the benefits tend to outnumber
the regressions). The performance section contains data for both on a variety of
benchmarks.


-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of major
factors:

- Desire to be target independent

- Most opportunities for congruency


The downside to having this type of transformation be at the IR level is it
means there will be less accuracy in the cost model -  we can somewhat
accurately model the cost per instruction but we can’t get information on how a
window of instructions may lower. This can cause regressions depending on the
platform/codebase, therefore to help alleviate this there are several tunable
parameters for the cost model.


-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about
benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz,
with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest
improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.


* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that
due to the amount of redundancy reduction early in the optimization pipeline
there is a reduction in the amount of instruction processing during the rest of
the compilation.


* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest
improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest
improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without
profile data, will extract from whatever region it deems profitable. Extracting
from the hot path can lead to a noticeable performance regression on any
platform, which can be somewhat avoided by providing profile data during
outlining.


-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are
comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a candidate.
The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity
required to catch the desired cases.


-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic
implementation that could be improved. There are definitely faster algorithms
and some can construct both simultaneously. We will investigate this as a
potential benefit for compile time in the future.


* Os performance tuning: Under -Os the pass currently only runs on cold blocks.
Ideally we could expand this to be a little more lax on less frequently executed
blocks that aren’t cold.


* Candidate Selection: The algorithm currently focuses on the longest common
sequences. More checks could be added to see if shortening the sequence length
produces a larger benefit(e.g less inputs/outputs). This would likely lead to an
increase in compile time but should remain less than the baseline.


* Non Exact Functions: The outliner currently ignores functions that do not have
an exact definition.


-- --

* CSiBE(Code Size Benchmark):

www.csibe.org<http://www.csibe.org/>


* More detailed performance data:

goo.gl/5k6wsP<http://goo.gl/5k6wsP>


* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/feb6370d/attachment-0001.html>

River Riddle via llvm-dev

2017-Jul-24 17:53 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Evgeny,
 The code hasn't been posted to Phabricator yet, we wanted to get feedback
on the current design and approach first. The code is forked on GitHub
currently and a link was included in the original email. As for the inliner
having an effect on the potential benefits, I just rebuilt llvm-tblgen with
"-fno-inline -fno-inline-functions" to see if there was any change in
results:
 - The inlining enabled Oz binary was ~20% smaller than the no-inline
binary.
 - LO still gave ~2% benefit(down from 2.5)
 - E+LO gave ~2.5% benefit(down from 3.82)
During testing I have noticed that E+LO functions often have other
functions inlined into them, which possibly leads to some of the benefit
that can be observed from it. I haven't really played around with the
inlining heuristics myself, so I likely won't be able to help very much
further in that regard.
 Thanks,
River Riddle

On Mon, Jul 24, 2017 at 5:38 AM, Evgeny Astigeevich <
Evgeny.Astigeevich at arm.com> wrote:
> Hi River,
>
>
> Thank you for information. Do you have your code in Phabricator?
>
>
> As I understand your pass always runs after the Inliner. At the moment
> the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner
> heuristics for Oz. It would be interesting to see if there are any code
> size improvements when the Inliner is disabled. This will demonstrate
> whether the Outliner could improve code size when duplicated code is not
> due to the Inliner. Another problem I have is that C++ programs get code
> size regressions due my changes in the inline heuristics. Have  you seen
> this problem?
>
>
> Thanks,
>
> Evgeny Astigeevich
> ------------------------------
> *From:* River Riddle <riddleriver at gmail.com>
> *Sent:* Friday, July 21, 2017 6:01:56 PM
> *To:* Evgeny Astigeevich
> *Cc:* llvm-dev at lists.llvm.org; nd; Kristof Beyls; gaborb at
inf.u-szeged.hu
> *Subject:* Re: [llvm-dev] [RFC] Add IR level interprocedural outliner for
> code size.
>
> Hi Evgeny,
>  I know of the current machine outliner in LLVM. If you look in the
"More
> detailed performance data"  in the end section it includes performance
> comparisons to the machine outliner.
>  As for the algorithmic approach they are kind of similar.
> Machine Outliner:
>   - Builds a suffix tree based on identical equivalence between machine
> instrs.
>   - Uses target specific cost model for benefit estimation.
>   - Uses a greedy algorithm for candidate pruning.
>   - Currently supports X86, AArch64 targets.
>   - Requires NoRedZone on supported targets.
>
> IR Outliner:
>   - Builds a suffix array based upon relaxed equivalence between ir instrs.
>   - Allows for inputs/outputs from an outlined function.
>   - Complex verification of outlining candidates due to above two
> points(relaxed equivalency, inputs/outputs).
>   - Tunable generic target independent cost model (estimates constant
> folded inputs, output condensing, etc.).
>        - Note it uses the target info available (TargetTransformInfo)
>   - Uses a greedy interval based algorithm for candidate pruning.
>   - Supports profile data.
>   - Supports emitting opt remarks.
>   - Target independent, thus no required ABI constraints or hooks.
> Though some parts of the approach are similar the implementation is very
> different. Also to note, the IR outliner can be expanded to also outline
> from multiple basic blocks. I've already prototyped the infrastructure
for
> outlining regions, so its possible.
>
> Thanks,
>   River Riddle
>
> On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <
> Evgeny.Astigeevich at arm.com> wrote:
>
>> Hi River,
>>
>>
>> Do you know LLVM has the MachineOutliner pass (
>> lib/CodeGen/MachineOutliner.cpp)?
>>
>> It would be interesting to compare your approach with it.
>>
>>
>> Thanks,
>>
>> Evgeny Astigeevich
>> ------------------------------
>> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf
of River
>> Riddle via llvm-dev <llvm-dev at lists.llvm.org>
>> *Sent:* Thursday, July 20, 2017 11:47:57 PM
>> *To:* llvm-dev at lists.llvm.org
>> *Subject:* [llvm-dev] [RFC] Add IR level interprocedural outliner for
>> code size.
>>
>>
>>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve
>> been working on an interprocedural outlining (code folding) pass for
code
>> size improvement at the IR level. We hit a couple of use cases that the
>> current code size solutions didn’t handle well enough. Outlining is one
of
>> the avenues that seemed potentially beneficial.
>>
>> -- Algorithmic Approach --
>>
>> The general implementation can be explained in stages:
>>
>> * Candidate Selection:
>>
>> Each instruction in the module is mapped to a congruence class based
upon
>> relaxed equivalency constraints. For most instructions this is simply:
The
>> type, opcode, and operand types. The constraints are tightened for
>> instructions with special state that require more exact equivalence
(e.g.
>> ShuffleVector requires a constant mask vector). Candidates are then
found
>> by constructing a suffix array and lcp(longest common prefix) array.
>> Walking the lcp array gives us congruent chains of instructions within
the
>> module.
>>
>> * Candidate Analysis:
>>
>> A verification step splits candidates that have different internal
input
>> sequences or incompatible parent function attributes between
occurrences.
>> An example of incompatible internal inputs sequences is:
>>
>> X = W + 6;   vs    X = W + 6;
>>
>> Y = X + 4;            Y = W + 4;
>>
>> The above two occurrences would need special control flow to exist
within
>> the same outlined function.
>>
>> During analysis candidates have their inputs and outputs computed along
>> with an estimated benefit from extraction. During input calculation a
>> constant folding step removes all inputs that are the same amongst all
>> occurrences.
>>
>> * Candidate Pruning:
>>
>> Overlapping candidates are pruned with a generic greedy algorithm that
>> picks occurrences starting from the most beneficial candidates.
>>
>> * Candidate Outlining:
>>
>> Non pruned candidates are then outlined. Outputs from a candidate are
>> returned via an output parameter, except for one output that is
promoted to
>> a return value. During outlining the inputs into the candidate are
>> condensed by computing the equivalencies between the arguments at each
>> occurrence. An example of this is:
>>
>> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>>
>> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>>
>> In the above, parameters 1 and 3 were found to be equivalent for all
>> occurrences, thus the amount of inputs was reduced to 2.
>>
>> * Debug Info:
>>
>> Debug information is preserved for the calls to functions which have
been
>> outlined but all debug info from the original outlined portions is
removed,
>> making them harder to debug.
>>
>> * Profile Info:
>>
>> If the pass is running at Os the outliner will only consider cold
blocks,
>> whereas Oz considers all blocks that are not marked as hot.
>>
>> * Location in Pipeline:
>>
>> The pass is currently configured to run very late in the optimization
>> pipeline. It is intended to run at Oz but will also run at Os if there
is
>> profile data available. The pass can optionally be run twice, once
before
>> function simplification and then again at the default location. This
run is
>> optional because you are gambling the potential benefits of redundancy
>> elimination vs the potential benefits from function simplification.
This
>> can lead to large benefits or regressions depending on the benchmark
>> (though the benefits tend to outnumber the regressions). The
performance
>> section contains data for both on a variety of benchmarks.
>>
>> -- Why at the IR level --
>>
>> The decision to place the outliner at the IR level comes from a couple
of
>> major factors:
>>
>> - Desire to be target independent
>>
>> - Most opportunities for congruency
>>
>> The downside to having this type of transformation be at the IR level
is
>> it means there will be less accuracy in the cost model -  we can
somewhat
>> accurately model the cost per instruction but we can’t get information
on
>> how a window of instructions may lower. This can cause regressions
>> depending on the platform/codebase, therefore to help alleviate this
there
>> are several tunable parameters for the cost model.
>>
>> -- Performance --
>>
>> More results including clang, llvm-tblgen, and more specific numbers
>> about benefits/regressions can be found in the notes section below.
>>
>> * Size Reduction:
>>
>> - Test Suite(X86_64):
>>
>>    - Early+Late outlining provides a geomean of 10.5% reduction over
>> clang Oz, with a largest improvement of ~67% and largest regression of
>> ~7.5%.
>>
>>    - Late outlining provides a geomean of 4.65% reduction, with a
largest
>> improvement of ~51% and largest regression of ~6.4%.
>>
>> - Spec 2006(X86_64)
>>
>>    - Early+Late outlining provides a geomean reduction of 2.08%.
>>
>>    - Late outlining provides 2.09%.
>>
>> - CSiBE(AArch64)
>>
>>    - Early+Late outlining shows geomean reduction of around 3.5%.
>>
>>    - Late outlining shows 3.1%.
>>
>> * Compile Time:
>>
>> Compile time was tested under test-suite with a multisample of 5.
>>
>> - Early+Late outlining
>>
>>    - Many improvements with > 40% compile time reduction.
>>
>>    - Few regressions.
>>
>> - Late outlining
>>
>>    - Greatest improvement is ~7.8%
>>
>>    - Greatest regression is ~4% with a difference of <0.02s
>>
>> Our explanation for the compile time reduction during early outlining
is
>> that due to the amount of redundancy reduction early in the
optimization
>> pipeline there is a reduction in the amount of instruction processing
>> during the rest of the compilation.
>>
>> * Execution Time:
>>
>> Ran with a multisample of 5.
>>
>> - Test Suite:
>>
>>    - Early+Late outlining has many regressions up to 97%. The greatest
>> improvement was around 7.5%.
>>
>>    - Late outlining also has several regressions up to 44% and a
greatest
>> improvement of around 5.3%.
>>
>> - Spec:
>>
>>    - Early+Late has a geomean regression of 3.5%.
>>
>>    - Late outlining has a geomean regression of 1.25%.
>>
>> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided
by
>> providing profile data during outlining.
>>
>> -- Tested Improvements --
>>
>> * MergeFunctions:
>>
>>    - No noticeable benefit.
>>
>> * LTO:
>>
>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO
are
>> comparable to non LTO.
>>
>> * Input/Output Partitioning:
>>
>>    -This identifies inputs/outputs that may be folded by splitting a
>> candidate. The benefit is minimal for the computations required.
>>
>> * Similar Candidate Merging:
>>
>>    - The benefit to be gained is currently not worth the large
complexity
>> required to catch the desired cases.
>>
>> -- Potential Improvements --
>>
>> * Suffix&LCP array construction: The pass currently uses a very
basic
>> implementation that could be improved. There are definitely faster
>> algorithms and some can construct both simultaneously. We will
investigate
>> this as a potential benefit for compile time in the future.
>>
>> * Os performance tuning: Under -Os the pass currently only runs on cold
>> blocks. Ideally we could expand this to be a little more lax on less
>> frequently executed blocks that aren’t cold.
>>
>> * Candidate Selection: The algorithm currently focuses on the longest
>> common sequences. More checks could be added to see if shortening the
>> sequence length produces a larger benefit(e.g less inputs/outputs).
This
>> would likely lead to an increase in compile time but should remain less
>> than the baseline.
>>
>> * Non Exact Functions: The outliner currently ignores functions that do
>> not have an exact definition.
>>
>> -- --
>>
>> * CSiBE(Code Size Benchmark):
>>
>> www.csibe.org
>>
>> * More detailed performance data:
>>
>> goo.gl/5k6wsP
>>
>> * Implementation:
>>
>> https://github.com/River707/llvm/blob/outliner/lib/Transform
>> s/IPO/CodeSizeOutliner.cpp
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/3df8c9d4/attachment-0001.html>

llvm dev - Jul 2017 - [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.