thr3ads.net - llvm dev - [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. [Jul 2017]

If this information is useful, please help other people find it:
Share via:

River Riddle via llvm-dev

2017-Jul-20 22:47 UTC

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
working on an interprocedural outlining (code folding) pass for code size
improvement at the IR level. We hit a couple of use cases that the current
code size solutions didn’t handle well enough. Outlining is one of the
avenues that seemed potentially beneficial.

-- Algorithmic Approach --

The general implementation can be explained in stages:

* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon
relaxed equivalency constraints. For most instructions this is simply: The
type, opcode, and operand types. The constraints are tightened for
instructions with special state that require more exact equivalence (e.g.
ShuffleVector requires a constant mask vector). Candidates are then found
by constructing a suffix array and lcp(longest common prefix) array.
Walking the lcp array gives us congruent chains of instructions within the
module.

* Candidate Analysis:

A verification step splits candidates that have different internal input
sequences or incompatible parent function attributes between occurrences.
An example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within
the same outlined function.

During analysis candidates have their inputs and outputs computed along
with an estimated benefit from extraction. During input calculation a
constant folding step removes all inputs that are the same amongst all
occurrences.

* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that
picks occurrences starting from the most beneficial candidates.

* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are
returned via an output parameter, except for one output that is promoted to
a return value. During outlining the inputs into the candidate are
condensed by computing the equivalencies between the arguments at each
occurrence. An example of this is:

outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);

In the above, parameters 1 and 3 were found to be equivalent for all
occurrences, thus the amount of inputs was reduced to 2.

* Debug Info:

Debug information is preserved for the calls to functions which have been
outlined but all debug info from the original outlined portions is removed,
making them harder to debug.

* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks,
whereas Oz considers all blocks that are not marked as hot.

* Location in Pipeline:

The pass is currently configured to run very late in the optimization
pipeline. It is intended to run at Oz but will also run at Os if there is
profile data available. The pass can optionally be run twice, once before
function simplification and then again at the default location. This run is
optional because you are gambling the potential benefits of redundancy
elimination vs the potential benefits from function simplification. This
can lead to large benefits or regressions depending on the benchmark
(though the benefits tend to outnumber the regressions). The performance
section contains data for both on a variety of benchmarks.

-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of
major factors:

- Desire to be target independent

- Most opportunities for congruency

The downside to having this type of transformation be at the IR level is it
means there will be less accuracy in the cost model -  we can somewhat
accurately model the cost per instruction but we can’t get information on
how a window of instructions may lower. This can cause regressions
depending on the platform/codebase, therefore to help alleviate this there
are several tunable parameters for the cost model.

-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about
benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang
Oz, with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest
improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.

* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is
that due to the amount of redundancy reduction early in the optimization
pipeline there is a reduction in the amount of instruction processing
during the rest of the compilation.

* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest
improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest
improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner,
without profile data, will extract from whatever region it deems
profitable. Extracting from the hot path can lead to a noticeable
performance regression on any platform, which can be somewhat avoided by
providing profile data during outlining.

-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are
comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a
candidate. The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity
required to catch the desired cases.

-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic
implementation that could be improved. There are definitely faster
algorithms and some can construct both simultaneously. We will investigate
this as a potential benefit for compile time in the future.

* Os performance tuning: Under -Os the pass currently only runs on cold
blocks. Ideally we could expand this to be a little more lax on less
frequently executed blocks that aren’t cold.

* Candidate Selection: The algorithm currently focuses on the longest
common sequences. More checks could be added to see if shortening the
sequence length produces a larger benefit(e.g less inputs/outputs). This
would likely lead to an increase in compile time but should remain less
than the baseline.

* Non Exact Functions: The outliner currently ignores functions that do not
have an exact definition.

-- --

* CSiBE(Code Size Benchmark):

www.csibe.org

* More detailed performance data:

goo.gl/5k6wsP

* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170720/8733497e/attachment.html>

Evgeny Astigeevich via llvm-dev

2017-Jul-21 09:24 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi River,


Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)?

It would be interesting to compare your approach with it.


Thanks,

Evgeny Astigeevich

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of River
Riddle via llvm-dev <llvm-dev at lists.llvm.org>
Sent: Thursday, July 20, 2017 11:47:57 PM
To: llvm-dev at lists.llvm.org
Subject: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.


 I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
working on an interprocedural outlining (code folding) pass for code size
improvement at the IR level. We hit a couple of use cases that the current code
size solutions didn’t handle well enough. Outlining is one of the avenues that
seemed potentially beneficial.


-- Algorithmic Approach --

The general implementation can be explained in stages:


* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon
relaxed equivalency constraints. For most instructions this is simply: The type,
opcode, and operand types. The constraints are tightened for instructions with
special state that require more exact equivalence (e.g. ShuffleVector requires a
constant mask vector). Candidates are then found by constructing a suffix array
and lcp(longest common prefix) array. Walking the lcp array gives us congruent
chains of instructions within the module.


* Candidate Analysis:

A verification step splits candidates that have different internal input
sequences or incompatible parent function attributes between occurrences. An
example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within the
same outlined function.


During analysis candidates have their inputs and outputs computed along with an
estimated benefit from extraction. During input calculation a constant folding
step removes all inputs that are the same amongst all occurrences.


* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that picks
occurrences starting from the most beneficial candidates.


* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are returned
via an output parameter, except for one output that is promoted to a return
value. During outlining the inputs into the candidate are condensed by computing
the equivalencies between the arguments at each occurrence. An example of this
is:


outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);


In the above, parameters 1 and 3 were found to be equivalent for all
occurrences, thus the amount of inputs was reduced to 2.


* Debug Info:

Debug information is preserved for the calls to functions which have been
outlined but all debug info from the original outlined portions is removed,
making them harder to debug.


* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks,
whereas Oz considers all blocks that are not marked as hot.


* Location in Pipeline:

The pass is currently configured to run very late in the optimization pipeline.
It is intended to run at Oz but will also run at Os if there is profile data
available. The pass can optionally be run twice, once before function
simplification and then again at the default location. This run is optional
because you are gambling the potential benefits of redundancy elimination vs the
potential benefits from function simplification. This can lead to large benefits
or regressions depending on the benchmark (though the benefits tend to outnumber
the regressions). The performance section contains data for both on a variety of
benchmarks.


-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of major
factors:

- Desire to be target independent

- Most opportunities for congruency


The downside to having this type of transformation be at the IR level is it
means there will be less accuracy in the cost model -  we can somewhat
accurately model the cost per instruction but we can’t get information on how a
window of instructions may lower. This can cause regressions depending on the
platform/codebase, therefore to help alleviate this there are several tunable
parameters for the cost model.


-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about
benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz,
with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest
improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.


* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that
due to the amount of redundancy reduction early in the optimization pipeline
there is a reduction in the amount of instruction processing during the rest of
the compilation.


* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest
improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest
improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without
profile data, will extract from whatever region it deems profitable. Extracting
from the hot path can lead to a noticeable performance regression on any
platform, which can be somewhat avoided by providing profile data during
outlining.


-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are
comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a candidate.
The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity
required to catch the desired cases.


-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic
implementation that could be improved. There are definitely faster algorithms
and some can construct both simultaneously. We will investigate this as a
potential benefit for compile time in the future.


* Os performance tuning: Under -Os the pass currently only runs on cold blocks.
Ideally we could expand this to be a little more lax on less frequently executed
blocks that aren’t cold.


* Candidate Selection: The algorithm currently focuses on the longest common
sequences. More checks could be added to see if shortening the sequence length
produces a larger benefit(e.g less inputs/outputs). This would likely lead to an
increase in compile time but should remain less than the baseline.


* Non Exact Functions: The outliner currently ignores functions that do not have
an exact definition.


-- --

* CSiBE(Code Size Benchmark):

www.csibe.org<http://www.csibe.org/>


* More detailed performance data:

goo.gl/5k6wsP<http://goo.gl/5k6wsP>


* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170721/704ff65f/attachment.html>

River Riddle via llvm-dev

2017-Jul-21 17:01 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More
detailed performance data"  in the end section it includes performance
comparisons to the machine outliner.
 As for the algorithmic approach they are kind of similar.
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine
instrs.
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.

IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two
points(relaxed equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant
folded inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo)
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very
different. Also to note, the IR outliner can be expanded to also outline
from multiple basic blocks. I've already prototyped the infrastructure for
outlining regions, so its possible.

Thanks,
  River Riddle

On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <
Evgeny.Astigeevich at arm.com> wrote:
> Hi River,
>
>
> Do you know LLVM has the MachineOutliner pass (
> lib/CodeGen/MachineOutliner.cpp)?
>
> It would be interesting to compare your approach with it.
>
>
> Thanks,
>
> Evgeny Astigeevich
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
River
> Riddle via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* Thursday, July 20, 2017 11:47:57 PM
> *To:* llvm-dev at lists.llvm.org
> *Subject:* [llvm-dev] [RFC] Add IR level interprocedural outliner for
> code size.
>
>
>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
> working on an interprocedural outlining (code folding) pass for code size
> improvement at the IR level. We hit a couple of use cases that the current
> code size solutions didn’t handle well enough. Outlining is one of the
> avenues that seemed potentially beneficial.
>
> -- Algorithmic Approach --
>
> The general implementation can be explained in stages:
>
> * Candidate Selection:
>
> Each instruction in the module is mapped to a congruence class based upon
> relaxed equivalency constraints. For most instructions this is simply: The
> type, opcode, and operand types. The constraints are tightened for
> instructions with special state that require more exact equivalence (e.g.
> ShuffleVector requires a constant mask vector). Candidates are then found
> by constructing a suffix array and lcp(longest common prefix) array.
> Walking the lcp array gives us congruent chains of instructions within the
> module.
>
> * Candidate Analysis:
>
> A verification step splits candidates that have different internal input
> sequences or incompatible parent function attributes between occurrences.
> An example of incompatible internal inputs sequences is:
>
> X = W + 6;   vs    X = W + 6;
>
> Y = X + 4;            Y = W + 4;
>
> The above two occurrences would need special control flow to exist within
> the same outlined function.
>
> During analysis candidates have their inputs and outputs computed along
> with an estimated benefit from extraction. During input calculation a
> constant folding step removes all inputs that are the same amongst all
> occurrences.
>
> * Candidate Pruning:
>
> Overlapping candidates are pruned with a generic greedy algorithm that
> picks occurrences starting from the most beneficial candidates.
>
> * Candidate Outlining:
>
> Non pruned candidates are then outlined. Outputs from a candidate are
> returned via an output parameter, except for one output that is promoted to
> a return value. During outlining the inputs into the candidate are
> condensed by computing the equivalencies between the arguments at each
> occurrence. An example of this is:
>
> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>
> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>
> In the above, parameters 1 and 3 were found to be equivalent for all
> occurrences, thus the amount of inputs was reduced to 2.
>
> * Debug Info:
>
> Debug information is preserved for the calls to functions which have been
> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
> * Profile Info:
>
> If the pass is running at Os the outliner will only consider cold blocks,
> whereas Oz considers all blocks that are not marked as hot.
>
> * Location in Pipeline:
>
> The pass is currently configured to run very late in the optimization
> pipeline. It is intended to run at Oz but will also run at Os if there is
> profile data available. The pass can optionally be run twice, once before
> function simplification and then again at the default location. This run is
> optional because you are gambling the potential benefits of redundancy
> elimination vs the potential benefits from function simplification. This
> can lead to large benefits or regressions depending on the benchmark
> (though the benefits tend to outnumber the regressions). The performance
> section contains data for both on a variety of benchmarks.
>
> -- Why at the IR level --
>
> The decision to place the outliner at the IR level comes from a couple of
> major factors:
>
> - Desire to be target independent
>
> - Most opportunities for congruency
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
> -- Performance --
>
> More results including clang, llvm-tblgen, and more specific numbers about
> benefits/regressions can be found in the notes section below.
>
> * Size Reduction:
>
> - Test Suite(X86_64):
>
>    - Early+Late outlining provides a geomean of 10.5% reduction over clang
> Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
>
>    - Late outlining provides a geomean of 4.65% reduction, with a largest
> improvement of ~51% and largest regression of ~6.4%.
>
> - Spec 2006(X86_64)
>
>    - Early+Late outlining provides a geomean reduction of 2.08%.
>
>    - Late outlining provides 2.09%.
>
> - CSiBE(AArch64)
>
>    - Early+Late outlining shows geomean reduction of around 3.5%.
>
>    - Late outlining shows 3.1%.
>
> * Compile Time:
>
> Compile time was tested under test-suite with a multisample of 5.
>
> - Early+Late outlining
>
>    - Many improvements with > 40% compile time reduction.
>
>    - Few regressions.
>
> - Late outlining
>
>    - Greatest improvement is ~7.8%
>
>    - Greatest regression is ~4% with a difference of <0.02s
>
> Our explanation for the compile time reduction during early outlining is
> that due to the amount of redundancy reduction early in the optimization
> pipeline there is a reduction in the amount of instruction processing
> during the rest of the compilation.
>
> * Execution Time:
>
> Ran with a multisample of 5.
>
> - Test Suite:
>
>    - Early+Late outlining has many regressions up to 97%. The greatest
> improvement was around 7.5%.
>
>    - Late outlining also has several regressions up to 44% and a greatest
> improvement of around 5.3%.
>
> - Spec:
>
>    - Early+Late has a geomean regression of 3.5%.
>
>    - Late outlining has a geomean regression of 1.25%.
>
> The execution time results are to be expected given that the outliner,
> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
> -- Tested Improvements --
>
> * MergeFunctions:
>
>    - No noticeable benefit.
>
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
> * Input/Output Partitioning:
>
>    -This identifies inputs/outputs that may be folded by splitting a
> candidate. The benefit is minimal for the computations required.
>
> * Similar Candidate Merging:
>
>    - The benefit to be gained is currently not worth the large complexity
> required to catch the desired cases.
>
> -- Potential Improvements --
>
> * Suffix&LCP array construction: The pass currently uses a very basic
> implementation that could be improved. There are definitely faster
> algorithms and some can construct both simultaneously. We will investigate
> this as a potential benefit for compile time in the future.
>
> * Os performance tuning: Under -Os the pass currently only runs on cold
> blocks. Ideally we could expand this to be a little more lax on less
> frequently executed blocks that aren’t cold.
>
> * Candidate Selection: The algorithm currently focuses on the longest
> common sequences. More checks could be added to see if shortening the
> sequence length produces a larger benefit(e.g less inputs/outputs). This
> would likely lead to an increase in compile time but should remain less
> than the baseline.
>
> * Non Exact Functions: The outliner currently ignores functions that do
> not have an exact definition.
>
> -- --
>
> * CSiBE(Code Size Benchmark):
>
> www.csibe.org
>
> * More detailed performance data:
>
> goo.gl/5k6wsP
>
> * Implementation:
>
> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
> CodeSizeOutliner.cpp
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170721/e82cbbc5/attachment-0001.html>

Andrey Bokhanko via llvm-dev

2017-Jul-22 22:23 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi River,

Very impressive! -- thanks for working on this.

A few questions, if you don't mind.

First, on results (from goo.gl/5k6wsP). Some of them are quite surprising.
In theory, "top improvements" should be quite similar in all three
approaches ("Early&Late Outlining", "Late Outlining" and
"Machine
Outliner"), with E&LO capturing most of the cases. Yet, they are very
different:

Test Suite, top improvements:
E&LO:

   -

   enc-3des: 66.31%
   -

   StatementReordering-dbl: 51.45%
   -

   Symbolics-dbl: 51.42%
   -

   Recurrences-dbl: 51.38%
   -

   Packing-dbl: 51.33%

LO:

   -

   enc-3des: 50.7%
   -

   ecbdes: 46.27%
   -

   security-rjindael:45.13%
   -

   ControlFlow-flt: 25.79%
   -

   ControlFlow-dbl: 25.74%

MO:

   -

   ecbdes: 28.22%
   -

   Expansion-flt: 22.56%
   -

   Recurrences-flt: 22.19%
   -

   StatementReordering-flt: 22.15%
   -

   Searching-flt: 21.96%


SPEC, top improvements:
E&LO:

   -

   bzip2: 9.15%
   -

   gcc: 4.03%
   -

   sphinx3: 3.8%
   -

   H264ref: 3.24%
   -

   Perlbench: 3%

LO:

   -

   bzip2: 7.27%
   -

   sphinx3: 3.65%
   -

   Namd: 3.08%
   -

   Gcc: 3.06%
   -

   H264ref: 3.05%

MO:

   -

   Namd: 7.8%
   -

   bzip2: 7.27%
   -

   libquantum: 2.99%
   -

   h264ref: 2%


Do you understand why so?

I'm especially interested in cases where MO managed to find redundancies
while E&O+LO didn't. For example, 2.99% on libquantum (or is it simply
below "top 5 results" for E&O+LO?) -- did you investigated this?

Also, it would be nice to specify full options list for SPEC (I assume SPEC
CPU2006?), similar to how results are reported on spec.org.

And a few questions on the RFC:

On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> * Debug Info:
>Debug information is preserved for the calls to functions which have
been> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
Just to check I understand it correctly: you remove *all* debug info in
outlined functions, essentially making them undebuggable -- correct? Did
you considered copying debug info from one of outlined fragments instead?
-- at least line numbers?

The execution time results are to be expected given that the
outliner,> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
Some of regressions are quite severe. It would be interesting to implement
what you stated above and measure -- both code size reductions and
performance degradations -- again.

> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
LTO is known to affect code size significantly (for example, by removing
redundant functions), so I'm frankly quite surprised that the results are
the same...

Yours,
Andrey
==Compiler Architect
NXP
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170723/96fe9492/attachment.html>

River Riddle via llvm-dev

2017-Jul-22 23:05 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Andrey,
 Questions and feedback are very much welcome.
- The explanation as to why the improvements can vary between the IR and
MIR outliner mainly boil down to the level of abstraction that each are
working at. The MIR level has very accurate heuristics and is effectively
the last post ISel target independent code gen pass. The IR outliner on the
other hand has more estimation in the cost model, and can affect the
decisions of function simplification passes, instruction selection, RA,
etc. Taking this into account can lead to different results. Its important
to remember the differences that being at a higher abstraction level can
cause.
- As for the spec(it is 2006, sorry I missed that) command line options, as
well as any other benchmark, I only added "-Oz -mno-red-zone(to keep
comparisons fair) -(whatever enables each transform)" to the default flags
needed for compilation. I'll try to get the exact command line options used
and add them.
- Debug information(line numbers) is currently only kept if it is the same
across all occurrences. This was simply a design choice and can be changed
if keeping one instance is the desired behavior.
- The behavior described with the profile data is already implemented, I
will update the numbers to include the results after including pgo data.
- The LTO results are very experimental given that there isn't a size
pipeline for LTO yet(there should be). The %improvements can be similar to
non LTO but because the LTO binary is generally much smaller the actual
decrease in size is also much smaller. I'll add more detailed LTO numbers
as well.
Thanks,
River Riddle

On Sat, Jul 22, 2017 at 3:23 PM, Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi River,
>
> Very impressive! -- thanks for working on this.
>
> A few questions, if you don't mind.
>
> First, on results (from goo.gl/5k6wsP). Some of them are quite
> surprising. In theory, "top improvements" should be quite similar
in all
> three approaches ("Early&Late Outlining", "Late
Outlining" and "Machine
> Outliner"), with E&LO capturing most of the cases. Yet, they are
very
> different:
>
> Test Suite, top improvements:
> E&LO:
>
>    -
>
>    enc-3des: 66.31%
>    -
>
>    StatementReordering-dbl: 51.45%
>    -
>
>    Symbolics-dbl: 51.42%
>    -
>
>    Recurrences-dbl: 51.38%
>    -
>
>    Packing-dbl: 51.33%
>
> LO:
>
>    -
>
>    enc-3des: 50.7%
>    -
>
>    ecbdes: 46.27%
>    -
>
>    security-rjindael:45.13%
>    -
>
>    ControlFlow-flt: 25.79%
>    -
>
>    ControlFlow-dbl: 25.74%
>
> MO:
>
>    -
>
>    ecbdes: 28.22%
>    -
>
>    Expansion-flt: 22.56%
>    -
>
>    Recurrences-flt: 22.19%
>    -
>
>    StatementReordering-flt: 22.15%
>    -
>
>    Searching-flt: 21.96%
>
>
> SPEC, top improvements:
> E&LO:
>
>    -
>
>    bzip2: 9.15%
>    -
>
>    gcc: 4.03%
>    -
>
>    sphinx3: 3.8%
>    -
>
>    H264ref: 3.24%
>    -
>
>    Perlbench: 3%
>
> LO:
>
>    -
>
>    bzip2: 7.27%
>    -
>
>    sphinx3: 3.65%
>    -
>
>    Namd: 3.08%
>    -
>
>    Gcc: 3.06%
>    -
>
>    H264ref: 3.05%
>
> MO:
>
>    -
>
>    Namd: 7.8%
>    -
>
>    bzip2: 7.27%
>    -
>
>    libquantum: 2.99%
>    -
>
>    h264ref: 2%
>
>
> Do you understand why so?
>
> I'm especially interested in cases where MO managed to find
redundancies
> while E&O+LO didn't. For example, 2.99% on libquantum (or is it
simply
> below "top 5 results" for E&O+LO?) -- did you investigated
this?
>
> Also, it would be nice to specify full options list for SPEC (I assume
> SPEC CPU2006?), similar to how results are reported on spec.org.
>
> And a few questions on the RFC:
>
> On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> * Debug Info:
>>
> Debug information is preserved for the calls to functions which have been
>> outlined but all debug info from the original outlined portions is
removed,
>> making them harder to debug.
>>
>
> Just to check I understand it correctly: you remove *all* debug info in
> outlined functions, essentially making them undebuggable -- correct? Did
> you considered copying debug info from one of outlined fragments instead?
> -- at least line numbers?
>
> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided
by
>> providing profile data during outlining.
>>
>
> Some of regressions are quite severe. It would be interesting to implement
> what you stated above and measure -- both code size reductions and
> performance degradations -- again.
>
>
>> * LTO:
>>
>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO
are
>> comparable to non LTO.
>>
>
> LTO is known to affect code size significantly (for example, by removing
> redundant functions), so I'm frankly quite surprised that the results
are
> the same...
>
> Yours,
> Andrey
> ==> Compiler Architect
> NXP
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170722/31e66d68/attachment-0001.html>

Jessica Paquette via llvm-dev

2017-Jul-24 18:12 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi River,

I’m working on the MachineOutliner pass at the MIR level. Working at the IR
level sounds interesting! It also seems like our algorithms are similar. I was
thinking of taking the suffix array route with the MachineOutliner in the
future.

Anyway, I’d like to ask about this:
> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> The downside to having this type of transformation be at the IR level is it
means there will be less accuracy in the cost model -  we can somewhat
accurately model the cost per instruction but we can’t get information on how a
window of instructions may lower. This can cause regressions depending on the
platform/codebase, therefore to help alleviate this there are several tunable
parameters for the cost model.
The inliner is threshold-based and it can be rather unpredictable how it will
impact the code size of a program. Do you have any idea as to how
heuristics/thresholds/parameters could be tuned to prevent this? In my
experience, making good code size decisions with these sorts of passes requires
a lot of knowledge about what instructions you’re dealing with exactly. I’ve
seen the inliner cause some pretty serious code size regressions in projects due
to small changes to the cost model/parameters which cause improvements in other
projects. I’m a little worried that an IR-level outliner for code size would be
doomed to a similar fate.

Perhaps it would be interesting to run this sort of pass pre-register
allocation? This would help pull you away from having to use heuristics, but
give you some more opportunities for finding repeated instruction sequences.
I’ve thought of doing something like this in the future with the MachineOutliner
and seeing how it goes.

- Jessica

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/d8aa94d0/attachment.html>

River Riddle via llvm-dev

2017-Jul-24 18:55 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Jessica,
 The comparison to the inliner is an interesting one but we think it's
important to note the difference in the use of heuristics. The inliner is
juggling many different tasks at the same time, execution speed, code size,
etc. which can cause the parameters to be very sensitive depending on the
benchmark/platform/etc. The outliners heuristics are focused solely on the
potential code size savings from outlining, and is thus only sensitive to
the current platform. This only creates a problem when we are over
estimating the potential cost of a set of instructions for a particular
target. The cost model parameters are only minimums: instruction sequence
length, estimated benefit, occurrence amount. The heuristics themselves are
conservative and based upon all of the target information available at the
IR level, the parameters are just setting a lower bound to weed out any
outliers. You are correct in that being at the machine level, before or
after RA, will give the most accurate heuristics but we feel there's an
advantage to being at the IR level. At the IR level we can do so many more
things that are either too difficult/complex for the machine level(e.g
parameterization/outputs/etc). Not only can we do these things but they are
available on all targets immediately, without the need for target hooks.
The caution on the use of heuristics is understandable, but there comes a
point when trade offs need to be made. We made the trade off for a loss in
exact cost modeling to gain flexibility, coverage, and potential for
further features. This trade off is the same made for quite a few IR level
optimizations, including inlining. As for the worry about code size
regressions, so far the results seem to support our hypothesis.
 Thanks,
River Riddle

On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at
apple.com>
wrote:
>
> Hi River,
>
> I’m working on the MachineOutliner pass at the MIR level. Working at the
> IR level sounds interesting! It also seems like our algorithms are similar.
> I was thinking of taking the suffix array route with the MachineOutliner in
> the future.
>
> Anyway, I’d like to ask about this:
>
> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
>
> The inliner is threshold-based and it can be rather unpredictable how it
> will impact the code size of a program. Do you have any idea as to how
> heuristics/thresholds/parameters could be tuned to prevent this? In my
> experience, making good code size decisions with these sorts of passes
> requires a lot of knowledge about what instructions you’re dealing with
> exactly. I’ve seen the inliner cause some pretty serious code size
> regressions in projects due to small changes to the cost model/parameters
> which cause improvements in other projects. I’m a little worried that an
> IR-level outliner for code size would be doomed to a similar fate.
>
> Perhaps it would be interesting to run this sort of pass pre-register
> allocation? This would help pull you away from having to use heuristics,
> but give you some more opportunities for finding repeated instruction
> sequences. I’ve thought of doing something like this in the future with the
> MachineOutliner and seeing how it goes.
>
> - Jessica
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/73a8be56/attachment.html>

Sean Silva via llvm-dev

2017-Jul-25 04:17 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
> working on an interprocedural outlining (code folding) pass for code size
> improvement at the IR level. We hit a couple of use cases that the current
> code size solutions didn’t handle well enough. Outlining is one of the
> avenues that seemed potentially beneficial.
>
> -- Algorithmic Approach --
>
> The general implementation can be explained in stages:
>
> * Candidate Selection:
>
> Each instruction in the module is mapped to a congruence class based upon
> relaxed equivalency constraints. For most instructions this is simply: The
> type, opcode, and operand types. The constraints are tightened for
> instructions with special state that require more exact equivalence (e.g.
> ShuffleVector requires a constant mask vector). Candidates are then found
> by constructing a suffix array and lcp(longest common prefix) array.
> Walking the lcp array gives us congruent chains of instructions within the
> module.
>
> * Candidate Analysis:
>
> A verification step splits candidates that have different internal input
> sequences or incompatible parent function attributes between occurrences.
> An example of incompatible internal inputs sequences is:
>
> X = W + 6;   vs    X = W + 6;
>
> Y = X + 4;            Y = W + 4;
>
> The above two occurrences would need special control flow to exist within
> the same outlined function.
>
> During analysis candidates have their inputs and outputs computed along
> with an estimated benefit from extraction. During input calculation a
> constant folding step removes all inputs that are the same amongst all
> occurrences.
>
> * Candidate Pruning:
>
> Overlapping candidates are pruned with a generic greedy algorithm that
> picks occurrences starting from the most beneficial candidates.
>
> * Candidate Outlining:
>
> Non pruned candidates are then outlined. Outputs from a candidate are
> returned via an output parameter, except for one output that is promoted to
> a return value. During outlining the inputs into the candidate are
> condensed by computing the equivalencies between the arguments at each
> occurrence. An example of this is:
>
> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>
> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>
> In the above, parameters 1 and 3 were found to be equivalent for all
> occurrences, thus the amount of inputs was reduced to 2.
>
> * Debug Info:
>
> Debug information is preserved for the calls to functions which have been
> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
> * Profile Info:
>
> If the pass is running at Os the outliner will only consider cold blocks,
> whereas Oz considers all blocks that are not marked as hot.
>
> * Location in Pipeline:
>
> The pass is currently configured to run very late in the optimization
> pipeline. It is intended to run at Oz but will also run at Os if there is
> profile data available. The pass can optionally be run twice, once before
> function simplification and then again at the default location. This run is
> optional because you are gambling the potential benefits of redundancy
> elimination vs the potential benefits from function simplification. This
> can lead to large benefits or regressions depending on the benchmark
> (though the benefits tend to outnumber the regressions). The performance
> section contains data for both on a variety of benchmarks.
>
> -- Why at the IR level --
>
> The decision to place the outliner at the IR level comes from a couple of
> major factors:
>
> - Desire to be target independent
>
> - Most opportunities for congruency
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
> -- Performance --
>
> More results including clang, llvm-tblgen, and more specific numbers about
> benefits/regressions can be found in the notes section below.
>
> * Size Reduction:
>
> - Test Suite(X86_64):
>
>    - Early+Late outlining provides a geomean of 10.5% reduction over clang
> Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
>
>    - Late outlining provides a geomean of 4.65% reduction, with a largest
> improvement of ~51% and largest regression of ~6.4%.
>
> - Spec 2006(X86_64)
>
>    - Early+Late outlining provides a geomean reduction of 2.08%.
>
>    - Late outlining provides 2.09%.
>
> - CSiBE(AArch64)
>
>    - Early+Late outlining shows geomean reduction of around 3.5%.
>
>    - Late outlining shows 3.1%.
>

It would be good to visualize these results. Maybe a bar chart like
https://goo.gl/qN2HqA from
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html for
SPEC?

>
> * Compile Time:
>
> Compile time was tested under test-suite with a multisample of 5.
>
> - Early+Late outlining
>
>    - Many improvements with > 40% compile time reduction.
>
>    - Few regressions.
>
> - Late outlining
>
>    - Greatest improvement is ~7.8%
>
>    - Greatest regression is ~4% with a difference of <0.02s
>
> Our explanation for the compile time reduction during early outlining is
> that due to the amount of redundancy reduction early in the optimization
> pipeline there is a reduction in the amount of instruction processing
> during the rest of the compilation.
>
> * Execution Time:
>
> Ran with a multisample of 5.
>
> - Test Suite:
>
>    - Early+Late outlining has many regressions up to 97%. The greatest
> improvement was around 7.5%.
>
>    - Late outlining also has several regressions up to 44% and a greatest
> improvement of around 5.3%.
>
> - Spec:
>
>    - Early+Late has a geomean regression of 3.5%.
>
>    - Late outlining has a geomean regression of 1.25%.
>
> The execution time results are to be expected given that the outliner,
> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
> -- Tested Improvements --
>
> * MergeFunctions:
>
>    - No noticeable benefit.
>
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
-Os/-Oz are communicated through the optsize and minsize attributes. There
isn't a specific code size pipeline per se (I think this is true for per-TU
compilation as well, though I would have to check).

Also, can you clarify what you mean by "LTO"? I assume this means that
the
outliner did not run during per-TU compilation but did run on the combined
FullLTO bitcode, but want to check to be sure.

-- Sean Silva

> * Input/Output Partitioning:
>
>    -This identifies inputs/outputs that may be folded by splitting a
> candidate. The benefit is minimal for the computations required.
>
> * Similar Candidate Merging:
>
>    - The benefit to be gained is currently not worth the large complexity
> required to catch the desired cases.
>
> -- Potential Improvements --
>
> * Suffix&LCP array construction: The pass currently uses a very basic
> implementation that could be improved. There are definitely faster
> algorithms and some can construct both simultaneously. We will investigate
> this as a potential benefit for compile time in the future.
>
> * Os performance tuning: Under -Os the pass currently only runs on cold
> blocks. Ideally we could expand this to be a little more lax on less
> frequently executed blocks that aren’t cold.
>
> * Candidate Selection: The algorithm currently focuses on the longest
> common sequences. More checks could be added to see if shortening the
> sequence length produces a larger benefit(e.g less inputs/outputs). This
> would likely lead to an increase in compile time but should remain less
> than the baseline.
>
> * Non Exact Functions: The outliner currently ignores functions that do
> not have an exact definition.
>
> -- --
>
> * CSiBE(Code Size Benchmark):
>
> www.csibe.org
>
> * More detailed performance data:
>
> goo.gl/5k6wsP
>
> * Implementation:
>
> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
> CodeSizeOutliner.cpp
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6112d84e/attachment.html>

River Riddle via llvm-dev

2017-Jul-25 04:25 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hey Sean,
  The bit about the attributes is good to know. When LTO is enabled the
early run will still run per-TU but the late run will be shifted to work on
the full LTO bitcode. Also I don't have specific numbers on how often
parameterization is utilized but I can assure you that it's a majority of
the time. I'll look into transforming the data into a visual format since
its in google docs anyways.
Thanks,
 River Riddle
On Mon, Jul 24, 2017 at 9:17 PM Sean Silva <chisophugis at gmail.com>
wrote:
> On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve
>> been working on an interprocedural outlining (code folding) pass for
code
>> size improvement at the IR level. We hit a couple of use cases that the
>> current code size solutions didn’t handle well enough. Outlining is one
of
>> the avenues that seemed potentially beneficial.
>>
>> -- Algorithmic Approach --
>>
>> The general implementation can be explained in stages:
>>
>> * Candidate Selection:
>>
>> Each instruction in the module is mapped to a congruence class based
upon
>> relaxed equivalency constraints. For most instructions this is simply:
The
>> type, opcode, and operand types. The constraints are tightened for
>> instructions with special state that require more exact equivalence
(e.g.
>> ShuffleVector requires a constant mask vector). Candidates are then
found
>> by constructing a suffix array and lcp(longest common prefix) array.
>> Walking the lcp array gives us congruent chains of instructions within
the
>> module.
>>
>> * Candidate Analysis:
>>
>> A verification step splits candidates that have different internal
input
>> sequences or incompatible parent function attributes between
occurrences.
>> An example of incompatible internal inputs sequences is:
>>
>> X = W + 6;   vs    X = W + 6;
>>
>> Y = X + 4;            Y = W + 4;
>>
>> The above two occurrences would need special control flow to exist
within
>> the same outlined function.
>>
>> During analysis candidates have their inputs and outputs computed along
>> with an estimated benefit from extraction. During input calculation a
>> constant folding step removes all inputs that are the same amongst all
>> occurrences.
>>
>> * Candidate Pruning:
>>
>> Overlapping candidates are pruned with a generic greedy algorithm that
>> picks occurrences starting from the most beneficial candidates.
>>
>> * Candidate Outlining:
>>
>> Non pruned candidates are then outlined. Outputs from a candidate are
>> returned via an output parameter, except for one output that is
promoted to
>> a return value. During outlining the inputs into the candidate are
>> condensed by computing the equivalencies between the arguments at each
>> occurrence. An example of this is:
>>
>> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>>
>> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>>
>> In the above, parameters 1 and 3 were found to be equivalent for all
>> occurrences, thus the amount of inputs was reduced to 2.
>>
>> * Debug Info:
>>
>> Debug information is preserved for the calls to functions which have
been
>> outlined but all debug info from the original outlined portions is
removed,
>> making them harder to debug.
>>
>> * Profile Info:
>>
>> If the pass is running at Os the outliner will only consider cold
blocks,
>> whereas Oz considers all blocks that are not marked as hot.
>>
>> * Location in Pipeline:
>>
>> The pass is currently configured to run very late in the optimization
>> pipeline. It is intended to run at Oz but will also run at Os if there
is
>> profile data available. The pass can optionally be run twice, once
before
>> function simplification and then again at the default location. This
run is
>> optional because you are gambling the potential benefits of redundancy
>> elimination vs the potential benefits from function simplification.
This
>> can lead to large benefits or regressions depending on the benchmark
>> (though the benefits tend to outnumber the regressions). The
performance
>> section contains data for both on a variety of benchmarks.
>>
>> -- Why at the IR level --
>>
>> The decision to place the outliner at the IR level comes from a couple
of
>> major factors:
>>
>> - Desire to be target independent
>>
>> - Most opportunities for congruency
>>
>> The downside to having this type of transformation be at the IR level
is
>> it means there will be less accuracy in the cost model -  we can
somewhat
>> accurately model the cost per instruction but we can’t get information
on
>> how a window of instructions may lower. This can cause regressions
>> depending on the platform/codebase, therefore to help alleviate this
there
>> are several tunable parameters for the cost model.
>>
>> -- Performance --
>>
>> More results including clang, llvm-tblgen, and more specific numbers
>> about benefits/regressions can be found in the notes section below.
>>
>> * Size Reduction:
>>
>> - Test Suite(X86_64):
>>
>>    - Early+Late outlining provides a geomean of 10.5% reduction over
>> clang Oz, with a largest improvement of ~67% and largest regression of
>> ~7.5%.
>>
>>    - Late outlining provides a geomean of 4.65% reduction, with a
largest
>> improvement of ~51% and largest regression of ~6.4%.
>>
>> - Spec 2006(X86_64)
>>
>>    - Early+Late outlining provides a geomean reduction of 2.08%.
>>
>>    - Late outlining provides 2.09%.
>>
>> - CSiBE(AArch64)
>>
>>    - Early+Late outlining shows geomean reduction of around 3.5%.
>>
>>    - Late outlining shows 3.1%.
>>
>
>
> It would be good to visualize these results. Maybe a bar chart like
> https://goo.gl/qN2HqA from http://blog.llvm.org/
> 2016/06/thinlto-scalable-and-incremental-lto.html for SPEC?
>
>
>>
>> * Compile Time:
>>
>> Compile time was tested under test-suite with a multisample of 5.
>>
>> - Early+Late outlining
>>
>>    - Many improvements with > 40% compile time reduction.
>>
>>    - Few regressions.
>>
>> - Late outlining
>>
>>    - Greatest improvement is ~7.8%
>>
>>    - Greatest regression is ~4% with a difference of <0.02s
>>
>> Our explanation for the compile time reduction during early outlining
is
>> that due to the amount of redundancy reduction early in the
optimization
>> pipeline there is a reduction in the amount of instruction processing
>> during the rest of the compilation.
>>
>> * Execution Time:
>>
>> Ran with a multisample of 5.
>>
>> - Test Suite:
>>
>>    - Early+Late outlining has many regressions up to 97%. The greatest
>> improvement was around 7.5%.
>>
>>    - Late outlining also has several regressions up to 44% and a
greatest
>> improvement of around 5.3%.
>>
>> - Spec:
>>
>>    - Early+Late has a geomean regression of 3.5%.
>>
>>    - Late outlining has a geomean regression of 1.25%.
>>
>> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided
by
>> providing profile data during outlining.
>>
>> -- Tested Improvements --
>>
>> * MergeFunctions:
>>
>>    - No noticeable benefit.
>>
>> * LTO:
>>
>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO
are
>> comparable to non LTO.
>>
>
> -Os/-Oz are communicated through the optsize and minsize attributes. There
> isn't a specific code size pipeline per se (I think this is true for
per-TU
> compilation as well, though I would have to check).
>
> Also, can you clarify what you mean by "LTO"? I assume this means
that the
> outliner did not run during per-TU compilation but did run on the combined
> FullLTO bitcode, but want to check to be sure.
>
> -- Sean Silva
>
>
>> * Input/Output Partitioning:
>>
>>    -This identifies inputs/outputs that may be folded by splitting a
>> candidate. The benefit is minimal for the computations required.
>>
>> * Similar Candidate Merging:
>>
>>    - The benefit to be gained is currently not worth the large
complexity
>> required to catch the desired cases.
>>
>> -- Potential Improvements --
>>
>> * Suffix&LCP array construction: The pass currently uses a very
basic
>> implementation that could be improved. There are definitely faster
>> algorithms and some can construct both simultaneously. We will
investigate
>> this as a potential benefit for compile time in the future.
>>
>> * Os performance tuning: Under -Os the pass currently only runs on cold
>> blocks. Ideally we could expand this to be a little more lax on less
>> frequently executed blocks that aren’t cold.
>>
>> * Candidate Selection: The algorithm currently focuses on the longest
>> common sequences. More checks could be added to see if shortening the
>> sequence length produces a larger benefit(e.g less inputs/outputs).
This
>> would likely lead to an increase in compile time but should remain less
>> than the baseline.
>>
>> * Non Exact Functions: The outliner currently ignores functions that do
>> not have an exact definition.
>>
>> -- --
>>
>> * CSiBE(Code Size Benchmark):
>>
>> www.csibe.org
>>
>> * More detailed performance data:
>>
>> goo.gl/5k6wsP
>>
>> * Implementation:
>>
>> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
>> CodeSizeOutliner.cpp
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6c427e9d/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jul 2017 - [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Possibly Parallel Threads