River Riddle via llvm-dev
2017-Jul-20  22:47 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial. -- Algorithmic Approach -- The general implementation can be explained in stages: * Candidate Selection: Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module. * Candidate Analysis: A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is: X = W + 6; vs X = W + 6; Y = X + 4; Y = W + 4; The above two occurrences would need special control flow to exist within the same outlined function. During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences. * Candidate Pruning: Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates. * Candidate Outlining: Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is: outlinedFn(1,6,1); -> outlinedFn(1,6); outlinedFn(2,4,2); -> outlinedFn(2,4); In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2. * Debug Info: Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug. * Profile Info: If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot. * Location in Pipeline: The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks. -- Why at the IR level -- The decision to place the outliner at the IR level comes from a couple of major factors: - Desire to be target independent - Most opportunities for congruency The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model. -- Performance -- More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below. * Size Reduction: - Test Suite(X86_64): - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%. - Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%. - Spec 2006(X86_64) - Early+Late outlining provides a geomean reduction of 2.08%. - Late outlining provides 2.09%. - CSiBE(AArch64) - Early+Late outlining shows geomean reduction of around 3.5%. - Late outlining shows 3.1%. * Compile Time: Compile time was tested under test-suite with a multisample of 5. - Early+Late outlining - Many improvements with > 40% compile time reduction. - Few regressions. - Late outlining - Greatest improvement is ~7.8% - Greatest regression is ~4% with a difference of <0.02s Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation. * Execution Time: Ran with a multisample of 5. - Test Suite: - Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%. - Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%. - Spec: - Early+Late has a geomean regression of 3.5%. - Late outlining has a geomean regression of 1.25%. The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining. -- Tested Improvements -- * MergeFunctions: - No noticeable benefit. * LTO: - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO. * Input/Output Partitioning: -This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required. * Similar Candidate Merging: - The benefit to be gained is currently not worth the large complexity required to catch the desired cases. -- Potential Improvements -- * Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future. * Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold. * Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline. * Non Exact Functions: The outliner currently ignores functions that do not have an exact definition. -- -- * CSiBE(Code Size Benchmark): www.csibe.org * More detailed performance data: goo.gl/5k6wsP * Implementation: https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170720/8733497e/attachment.html>
Evgeny Astigeevich via llvm-dev
2017-Jul-21  09:24 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi River, Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)? It would be interesting to compare your approach with it. Thanks, Evgeny Astigeevich ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of River Riddle via llvm-dev <llvm-dev at lists.llvm.org> Sent: Thursday, July 20, 2017 11:47:57 PM To: llvm-dev at lists.llvm.org Subject: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial. -- Algorithmic Approach -- The general implementation can be explained in stages: * Candidate Selection: Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module. * Candidate Analysis: A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is: X = W + 6; vs X = W + 6; Y = X + 4; Y = W + 4; The above two occurrences would need special control flow to exist within the same outlined function. During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences. * Candidate Pruning: Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates. * Candidate Outlining: Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is: outlinedFn(1,6,1); -> outlinedFn(1,6); outlinedFn(2,4,2); -> outlinedFn(2,4); In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2. * Debug Info: Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug. * Profile Info: If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot. * Location in Pipeline: The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks. -- Why at the IR level -- The decision to place the outliner at the IR level comes from a couple of major factors: - Desire to be target independent - Most opportunities for congruency The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model. -- Performance -- More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below. * Size Reduction: - Test Suite(X86_64): - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%. - Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%. - Spec 2006(X86_64) - Early+Late outlining provides a geomean reduction of 2.08%. - Late outlining provides 2.09%. - CSiBE(AArch64) - Early+Late outlining shows geomean reduction of around 3.5%. - Late outlining shows 3.1%. * Compile Time: Compile time was tested under test-suite with a multisample of 5. - Early+Late outlining - Many improvements with > 40% compile time reduction. - Few regressions. - Late outlining - Greatest improvement is ~7.8% - Greatest regression is ~4% with a difference of <0.02s Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation. * Execution Time: Ran with a multisample of 5. - Test Suite: - Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%. - Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%. - Spec: - Early+Late has a geomean regression of 3.5%. - Late outlining has a geomean regression of 1.25%. The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining. -- Tested Improvements -- * MergeFunctions: - No noticeable benefit. * LTO: - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO. * Input/Output Partitioning: -This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required. * Similar Candidate Merging: - The benefit to be gained is currently not worth the large complexity required to catch the desired cases. -- Potential Improvements -- * Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future. * Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold. * Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline. * Non Exact Functions: The outliner currently ignores functions that do not have an exact definition. -- -- * CSiBE(Code Size Benchmark): www.csibe.org<http://www.csibe.org/> * More detailed performance data: goo.gl/5k6wsP<http://goo.gl/5k6wsP> * Implementation: https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170721/704ff65f/attachment.html>
River Riddle via llvm-dev
2017-Jul-21  17:01 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More
detailed performance data"  in the end section it includes performance
comparisons to the machine outliner.
 As for the algorithmic approach they are kind of similar.
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine
instrs.
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.
IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two
points(relaxed equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant
folded inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo)
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very
different. Also to note, the IR outliner can be expanded to also outline
from multiple basic blocks. I've already prototyped the infrastructure for
outlining regions, so its possible.
Thanks,
  River Riddle
On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <
Evgeny.Astigeevich at arm.com> wrote:
> Hi River,
>
>
> Do you know LLVM has the MachineOutliner pass (
> lib/CodeGen/MachineOutliner.cpp)?
>
> It would be interesting to compare your approach with it.
>
>
> Thanks,
>
> Evgeny Astigeevich
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
River
> Riddle via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* Thursday, July 20, 2017 11:47:57 PM
> *To:* llvm-dev at lists.llvm.org
> *Subject:* [llvm-dev] [RFC] Add IR level interprocedural outliner for
> code size.
>
>
>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
> working on an interprocedural outlining (code folding) pass for code size
> improvement at the IR level. We hit a couple of use cases that the current
> code size solutions didn’t handle well enough. Outlining is one of the
> avenues that seemed potentially beneficial.
>
> -- Algorithmic Approach --
>
> The general implementation can be explained in stages:
>
> * Candidate Selection:
>
> Each instruction in the module is mapped to a congruence class based upon
> relaxed equivalency constraints. For most instructions this is simply: The
> type, opcode, and operand types. The constraints are tightened for
> instructions with special state that require more exact equivalence (e.g.
> ShuffleVector requires a constant mask vector). Candidates are then found
> by constructing a suffix array and lcp(longest common prefix) array.
> Walking the lcp array gives us congruent chains of instructions within the
> module.
>
> * Candidate Analysis:
>
> A verification step splits candidates that have different internal input
> sequences or incompatible parent function attributes between occurrences.
> An example of incompatible internal inputs sequences is:
>
> X = W + 6;   vs    X = W + 6;
>
> Y = X + 4;            Y = W + 4;
>
> The above two occurrences would need special control flow to exist within
> the same outlined function.
>
> During analysis candidates have their inputs and outputs computed along
> with an estimated benefit from extraction. During input calculation a
> constant folding step removes all inputs that are the same amongst all
> occurrences.
>
> * Candidate Pruning:
>
> Overlapping candidates are pruned with a generic greedy algorithm that
> picks occurrences starting from the most beneficial candidates.
>
> * Candidate Outlining:
>
> Non pruned candidates are then outlined. Outputs from a candidate are
> returned via an output parameter, except for one output that is promoted to
> a return value. During outlining the inputs into the candidate are
> condensed by computing the equivalencies between the arguments at each
> occurrence. An example of this is:
>
> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>
> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>
> In the above, parameters 1 and 3 were found to be equivalent for all
> occurrences, thus the amount of inputs was reduced to 2.
>
> * Debug Info:
>
> Debug information is preserved for the calls to functions which have been
> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
> * Profile Info:
>
> If the pass is running at Os the outliner will only consider cold blocks,
> whereas Oz considers all blocks that are not marked as hot.
>
> * Location in Pipeline:
>
> The pass is currently configured to run very late in the optimization
> pipeline. It is intended to run at Oz but will also run at Os if there is
> profile data available. The pass can optionally be run twice, once before
> function simplification and then again at the default location. This run is
> optional because you are gambling the potential benefits of redundancy
> elimination vs the potential benefits from function simplification. This
> can lead to large benefits or regressions depending on the benchmark
> (though the benefits tend to outnumber the regressions). The performance
> section contains data for both on a variety of benchmarks.
>
> -- Why at the IR level --
>
> The decision to place the outliner at the IR level comes from a couple of
> major factors:
>
> - Desire to be target independent
>
> - Most opportunities for congruency
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
> -- Performance --
>
> More results including clang, llvm-tblgen, and more specific numbers about
> benefits/regressions can be found in the notes section below.
>
> * Size Reduction:
>
> - Test Suite(X86_64):
>
>    - Early+Late outlining provides a geomean of 10.5% reduction over clang
> Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
>
>    - Late outlining provides a geomean of 4.65% reduction, with a largest
> improvement of ~51% and largest regression of ~6.4%.
>
> - Spec 2006(X86_64)
>
>    - Early+Late outlining provides a geomean reduction of 2.08%.
>
>    - Late outlining provides 2.09%.
>
> - CSiBE(AArch64)
>
>    - Early+Late outlining shows geomean reduction of around 3.5%.
>
>    - Late outlining shows 3.1%.
>
> * Compile Time:
>
> Compile time was tested under test-suite with a multisample of 5.
>
> - Early+Late outlining
>
>    - Many improvements with > 40% compile time reduction.
>
>    - Few regressions.
>
> - Late outlining
>
>    - Greatest improvement is ~7.8%
>
>    - Greatest regression is ~4% with a difference of <0.02s
>
> Our explanation for the compile time reduction during early outlining is
> that due to the amount of redundancy reduction early in the optimization
> pipeline there is a reduction in the amount of instruction processing
> during the rest of the compilation.
>
> * Execution Time:
>
> Ran with a multisample of 5.
>
> - Test Suite:
>
>    - Early+Late outlining has many regressions up to 97%. The greatest
> improvement was around 7.5%.
>
>    - Late outlining also has several regressions up to 44% and a greatest
> improvement of around 5.3%.
>
> - Spec:
>
>    - Early+Late has a geomean regression of 3.5%.
>
>    - Late outlining has a geomean regression of 1.25%.
>
> The execution time results are to be expected given that the outliner,
> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
> -- Tested Improvements --
>
> * MergeFunctions:
>
>    - No noticeable benefit.
>
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
> * Input/Output Partitioning:
>
>    -This identifies inputs/outputs that may be folded by splitting a
> candidate. The benefit is minimal for the computations required.
>
> * Similar Candidate Merging:
>
>    - The benefit to be gained is currently not worth the large complexity
> required to catch the desired cases.
>
> -- Potential Improvements --
>
> * Suffix&LCP array construction: The pass currently uses a very basic
> implementation that could be improved. There are definitely faster
> algorithms and some can construct both simultaneously. We will investigate
> this as a potential benefit for compile time in the future.
>
> * Os performance tuning: Under -Os the pass currently only runs on cold
> blocks. Ideally we could expand this to be a little more lax on less
> frequently executed blocks that aren’t cold.
>
> * Candidate Selection: The algorithm currently focuses on the longest
> common sequences. More checks could be added to see if shortening the
> sequence length produces a larger benefit(e.g less inputs/outputs). This
> would likely lead to an increase in compile time but should remain less
> than the baseline.
>
> * Non Exact Functions: The outliner currently ignores functions that do
> not have an exact definition.
>
> -- --
>
> * CSiBE(Code Size Benchmark):
>
> www.csibe.org
>
> * More detailed performance data:
>
> goo.gl/5k6wsP
>
> * Implementation:
>
> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
> CodeSizeOutliner.cpp
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170721/e82cbbc5/attachment-0001.html>
Andrey Bokhanko via llvm-dev
2017-Jul-22  22:23 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi River,
Very impressive! -- thanks for working on this.
A few questions, if you don't mind.
First, on results (from goo.gl/5k6wsP). Some of them are quite surprising.
In theory, "top improvements" should be quite similar in all three
approaches ("Early&Late Outlining", "Late Outlining" and
"Machine
Outliner"), with E&LO capturing most of the cases. Yet, they are very
different:
Test Suite, top improvements:
E&LO:
   -
   enc-3des: 66.31%
   -
   StatementReordering-dbl: 51.45%
   -
   Symbolics-dbl: 51.42%
   -
   Recurrences-dbl: 51.38%
   -
   Packing-dbl: 51.33%
LO:
   -
   enc-3des: 50.7%
   -
   ecbdes: 46.27%
   -
   security-rjindael:45.13%
   -
   ControlFlow-flt: 25.79%
   -
   ControlFlow-dbl: 25.74%
MO:
   -
   ecbdes: 28.22%
   -
   Expansion-flt: 22.56%
   -
   Recurrences-flt: 22.19%
   -
   StatementReordering-flt: 22.15%
   -
   Searching-flt: 21.96%
SPEC, top improvements:
E&LO:
   -
   bzip2: 9.15%
   -
   gcc: 4.03%
   -
   sphinx3: 3.8%
   -
   H264ref: 3.24%
   -
   Perlbench: 3%
LO:
   -
   bzip2: 7.27%
   -
   sphinx3: 3.65%
   -
   Namd: 3.08%
   -
   Gcc: 3.06%
   -
   H264ref: 3.05%
MO:
   -
   Namd: 7.8%
   -
   bzip2: 7.27%
   -
   libquantum: 2.99%
   -
   h264ref: 2%
Do you understand why so?
I'm especially interested in cases where MO managed to find redundancies
while E&O+LO didn't. For example, 2.99% on libquantum (or is it simply
below "top 5 results" for E&O+LO?) -- did you investigated this?
Also, it would be nice to specify full options list for SPEC (I assume SPEC
CPU2006?), similar to how results are reported on spec.org.
And a few questions on the RFC:
On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> * Debug Info:
>
Debug information is preserved for the calls to functions which have
been> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
Just to check I understand it correctly: you remove *all* debug info in
outlined functions, essentially making them undebuggable -- correct? Did
you considered copying debug info from one of outlined fragments instead?
-- at least line numbers?
The execution time results are to be expected given that the
outliner,> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
Some of regressions are quite severe. It would be interesting to implement
what you stated above and measure -- both code size reductions and
performance degradations -- again.
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>
LTO is known to affect code size significantly (for example, by removing
redundant functions), so I'm frankly quite surprised that the results are
the same...
Yours,
Andrey
==Compiler Architect
NXP
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170723/96fe9492/attachment.html>
River Riddle via llvm-dev
2017-Jul-22  23:05 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi Andrey, Questions and feedback are very much welcome. - The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results. Its important to remember the differences that being at a higher abstraction level can cause. - As for the spec(it is 2006, sorry I missed that) command line options, as well as any other benchmark, I only added "-Oz -mno-red-zone(to keep comparisons fair) -(whatever enables each transform)" to the default flags needed for compilation. I'll try to get the exact command line options used and add them. - Debug information(line numbers) is currently only kept if it is the same across all occurrences. This was simply a design choice and can be changed if keeping one instance is the desired behavior. - The behavior described with the profile data is already implemented, I will update the numbers to include the results after including pgo data. - The LTO results are very experimental given that there isn't a size pipeline for LTO yet(there should be). The %improvements can be similar to non LTO but because the LTO binary is generally much smaller the actual decrease in size is also much smaller. I'll add more detailed LTO numbers as well. Thanks, River Riddle On Sat, Jul 22, 2017 at 3:23 PM, Andrey Bokhanko <andreybokhanko at gmail.com> wrote:> Hi River, > > Very impressive! -- thanks for working on this. > > A few questions, if you don't mind. > > First, on results (from goo.gl/5k6wsP). Some of them are quite > surprising. In theory, "top improvements" should be quite similar in all > three approaches ("Early&Late Outlining", "Late Outlining" and "Machine > Outliner"), with E&LO capturing most of the cases. Yet, they are very > different: > > Test Suite, top improvements: > E&LO: > > - > > enc-3des: 66.31% > - > > StatementReordering-dbl: 51.45% > - > > Symbolics-dbl: 51.42% > - > > Recurrences-dbl: 51.38% > - > > Packing-dbl: 51.33% > > LO: > > - > > enc-3des: 50.7% > - > > ecbdes: 46.27% > - > > security-rjindael:45.13% > - > > ControlFlow-flt: 25.79% > - > > ControlFlow-dbl: 25.74% > > MO: > > - > > ecbdes: 28.22% > - > > Expansion-flt: 22.56% > - > > Recurrences-flt: 22.19% > - > > StatementReordering-flt: 22.15% > - > > Searching-flt: 21.96% > > > SPEC, top improvements: > E&LO: > > - > > bzip2: 9.15% > - > > gcc: 4.03% > - > > sphinx3: 3.8% > - > > H264ref: 3.24% > - > > Perlbench: 3% > > LO: > > - > > bzip2: 7.27% > - > > sphinx3: 3.65% > - > > Namd: 3.08% > - > > Gcc: 3.06% > - > > H264ref: 3.05% > > MO: > > - > > Namd: 7.8% > - > > bzip2: 7.27% > - > > libquantum: 2.99% > - > > h264ref: 2% > > > Do you understand why so? > > I'm especially interested in cases where MO managed to find redundancies > while E&O+LO didn't. For example, 2.99% on libquantum (or is it simply > below "top 5 results" for E&O+LO?) -- did you investigated this? > > Also, it would be nice to specify full options list for SPEC (I assume > SPEC CPU2006?), similar to how results are reported on spec.org. > > And a few questions on the RFC: > > On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> * Debug Info: >> > Debug information is preserved for the calls to functions which have been >> outlined but all debug info from the original outlined portions is removed, >> making them harder to debug. >> > > Just to check I understand it correctly: you remove *all* debug info in > outlined functions, essentially making them undebuggable -- correct? Did > you considered copying debug info from one of outlined fragments instead? > -- at least line numbers? > > The execution time results are to be expected given that the outliner, >> without profile data, will extract from whatever region it deems >> profitable. Extracting from the hot path can lead to a noticeable >> performance regression on any platform, which can be somewhat avoided by >> providing profile data during outlining. >> > > Some of regressions are quite severe. It would be interesting to implement > what you stated above and measure -- both code size reductions and > performance degradations -- again. > > >> * LTO: >> >> - LTO doesn’t have a code size pipeline, but %reductions over LTO are >> comparable to non LTO. >> > > LTO is known to affect code size significantly (for example, by removing > redundant functions), so I'm frankly quite surprised that the results are > the same... > > Yours, > Andrey > ==> Compiler Architect > NXP > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170722/31e66d68/attachment-0001.html>
Jessica Paquette via llvm-dev
2017-Jul-24  18:12 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi River, I’m working on the MachineOutliner pass at the MIR level. Working at the IR level sounds interesting! It also seems like our algorithms are similar. I was thinking of taking the suffix array route with the MachineOutliner in the future. Anyway, I’d like to ask about this:> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.The inliner is threshold-based and it can be rather unpredictable how it will impact the code size of a program. Do you have any idea as to how heuristics/thresholds/parameters could be tuned to prevent this? In my experience, making good code size decisions with these sorts of passes requires a lot of knowledge about what instructions you’re dealing with exactly. I’ve seen the inliner cause some pretty serious code size regressions in projects due to small changes to the cost model/parameters which cause improvements in other projects. I’m a little worried that an IR-level outliner for code size would be doomed to a similar fate. Perhaps it would be interesting to run this sort of pass pre-register allocation? This would help pull you away from having to use heuristics, but give you some more opportunities for finding repeated instruction sequences. I’ve thought of doing something like this in the future with the MachineOutliner and seeing how it goes. - Jessica -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/d8aa94d0/attachment.html>
River Riddle via llvm-dev
2017-Jul-24  18:55 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hi Jessica, The comparison to the inliner is an interesting one but we think it's important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there's an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis. Thanks, River Riddle On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at apple.com> wrote:> > Hi River, > > I’m working on the MachineOutliner pass at the MIR level. Working at the > IR level sounds interesting! It also seems like our algorithms are similar. > I was thinking of taking the suffix array route with the MachineOutliner in > the future. > > Anyway, I’d like to ask about this: > > On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > The downside to having this type of transformation be at the IR level is > it means there will be less accuracy in the cost model - we can somewhat > accurately model the cost per instruction but we can’t get information on > how a window of instructions may lower. This can cause regressions > depending on the platform/codebase, therefore to help alleviate this there > are several tunable parameters for the cost model. > > > The inliner is threshold-based and it can be rather unpredictable how it > will impact the code size of a program. Do you have any idea as to how > heuristics/thresholds/parameters could be tuned to prevent this? In my > experience, making good code size decisions with these sorts of passes > requires a lot of knowledge about what instructions you’re dealing with > exactly. I’ve seen the inliner cause some pretty serious code size > regressions in projects due to small changes to the cost model/parameters > which cause improvements in other projects. I’m a little worried that an > IR-level outliner for code size would be doomed to a similar fate. > > Perhaps it would be interesting to run this sort of pass pre-register > allocation? This would help pull you away from having to use heuristics, > but give you some more opportunities for finding repeated instruction > sequences. I’ve thought of doing something like this in the future with the > MachineOutliner and seeing how it goes. > > - Jessica > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/73a8be56/attachment.html>
Sean Silva via llvm-dev
2017-Jul-25  04:17 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been > working on an interprocedural outlining (code folding) pass for code size > improvement at the IR level. We hit a couple of use cases that the current > code size solutions didn’t handle well enough. Outlining is one of the > avenues that seemed potentially beneficial. > > -- Algorithmic Approach -- > > The general implementation can be explained in stages: > > * Candidate Selection: > > Each instruction in the module is mapped to a congruence class based upon > relaxed equivalency constraints. For most instructions this is simply: The > type, opcode, and operand types. The constraints are tightened for > instructions with special state that require more exact equivalence (e.g. > ShuffleVector requires a constant mask vector). Candidates are then found > by constructing a suffix array and lcp(longest common prefix) array. > Walking the lcp array gives us congruent chains of instructions within the > module. > > * Candidate Analysis: > > A verification step splits candidates that have different internal input > sequences or incompatible parent function attributes between occurrences. > An example of incompatible internal inputs sequences is: > > X = W + 6; vs X = W + 6; > > Y = X + 4; Y = W + 4; > > The above two occurrences would need special control flow to exist within > the same outlined function. > > During analysis candidates have their inputs and outputs computed along > with an estimated benefit from extraction. During input calculation a > constant folding step removes all inputs that are the same amongst all > occurrences. > > * Candidate Pruning: > > Overlapping candidates are pruned with a generic greedy algorithm that > picks occurrences starting from the most beneficial candidates. > > * Candidate Outlining: > > Non pruned candidates are then outlined. Outputs from a candidate are > returned via an output parameter, except for one output that is promoted to > a return value. During outlining the inputs into the candidate are > condensed by computing the equivalencies between the arguments at each > occurrence. An example of this is: > > outlinedFn(1,6,1); -> outlinedFn(1,6); > > outlinedFn(2,4,2); -> outlinedFn(2,4); > > In the above, parameters 1 and 3 were found to be equivalent for all > occurrences, thus the amount of inputs was reduced to 2. > > * Debug Info: > > Debug information is preserved for the calls to functions which have been > outlined but all debug info from the original outlined portions is removed, > making them harder to debug. > > * Profile Info: > > If the pass is running at Os the outliner will only consider cold blocks, > whereas Oz considers all blocks that are not marked as hot. > > * Location in Pipeline: > > The pass is currently configured to run very late in the optimization > pipeline. It is intended to run at Oz but will also run at Os if there is > profile data available. The pass can optionally be run twice, once before > function simplification and then again at the default location. This run is > optional because you are gambling the potential benefits of redundancy > elimination vs the potential benefits from function simplification. This > can lead to large benefits or regressions depending on the benchmark > (though the benefits tend to outnumber the regressions). The performance > section contains data for both on a variety of benchmarks. > > -- Why at the IR level -- > > The decision to place the outliner at the IR level comes from a couple of > major factors: > > - Desire to be target independent > > - Most opportunities for congruency > > The downside to having this type of transformation be at the IR level is > it means there will be less accuracy in the cost model - we can somewhat > accurately model the cost per instruction but we can’t get information on > how a window of instructions may lower. This can cause regressions > depending on the platform/codebase, therefore to help alleviate this there > are several tunable parameters for the cost model. > > -- Performance -- > > More results including clang, llvm-tblgen, and more specific numbers about > benefits/regressions can be found in the notes section below. > > * Size Reduction: > > - Test Suite(X86_64): > > - Early+Late outlining provides a geomean of 10.5% reduction over clang > Oz, with a largest improvement of ~67% and largest regression of ~7.5%. > > - Late outlining provides a geomean of 4.65% reduction, with a largest > improvement of ~51% and largest regression of ~6.4%. > > - Spec 2006(X86_64) > > - Early+Late outlining provides a geomean reduction of 2.08%. > > - Late outlining provides 2.09%. > > - CSiBE(AArch64) > > - Early+Late outlining shows geomean reduction of around 3.5%. > > - Late outlining shows 3.1%. >It would be good to visualize these results. Maybe a bar chart like https://goo.gl/qN2HqA from http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html for SPEC?> > * Compile Time: > > Compile time was tested under test-suite with a multisample of 5. > > - Early+Late outlining > > - Many improvements with > 40% compile time reduction. > > - Few regressions. > > - Late outlining > > - Greatest improvement is ~7.8% > > - Greatest regression is ~4% with a difference of <0.02s > > Our explanation for the compile time reduction during early outlining is > that due to the amount of redundancy reduction early in the optimization > pipeline there is a reduction in the amount of instruction processing > during the rest of the compilation. > > * Execution Time: > > Ran with a multisample of 5. > > - Test Suite: > > - Early+Late outlining has many regressions up to 97%. The greatest > improvement was around 7.5%. > > - Late outlining also has several regressions up to 44% and a greatest > improvement of around 5.3%. > > - Spec: > > - Early+Late has a geomean regression of 3.5%. > > - Late outlining has a geomean regression of 1.25%. > > The execution time results are to be expected given that the outliner, > without profile data, will extract from whatever region it deems > profitable. Extracting from the hot path can lead to a noticeable > performance regression on any platform, which can be somewhat avoided by > providing profile data during outlining. > > -- Tested Improvements -- > > * MergeFunctions: > > - No noticeable benefit. > > * LTO: > > - LTO doesn’t have a code size pipeline, but %reductions over LTO are > comparable to non LTO. >-Os/-Oz are communicated through the optsize and minsize attributes. There isn't a specific code size pipeline per se (I think this is true for per-TU compilation as well, though I would have to check). Also, can you clarify what you mean by "LTO"? I assume this means that the outliner did not run during per-TU compilation but did run on the combined FullLTO bitcode, but want to check to be sure. -- Sean Silva> * Input/Output Partitioning: > > -This identifies inputs/outputs that may be folded by splitting a > candidate. The benefit is minimal for the computations required. > > * Similar Candidate Merging: > > - The benefit to be gained is currently not worth the large complexity > required to catch the desired cases. > > -- Potential Improvements -- > > * Suffix&LCP array construction: The pass currently uses a very basic > implementation that could be improved. There are definitely faster > algorithms and some can construct both simultaneously. We will investigate > this as a potential benefit for compile time in the future. > > * Os performance tuning: Under -Os the pass currently only runs on cold > blocks. Ideally we could expand this to be a little more lax on less > frequently executed blocks that aren’t cold. > > * Candidate Selection: The algorithm currently focuses on the longest > common sequences. More checks could be added to see if shortening the > sequence length produces a larger benefit(e.g less inputs/outputs). This > would likely lead to an increase in compile time but should remain less > than the baseline. > > * Non Exact Functions: The outliner currently ignores functions that do > not have an exact definition. > > -- -- > > * CSiBE(Code Size Benchmark): > > www.csibe.org > > * More detailed performance data: > > goo.gl/5k6wsP > > * Implementation: > > https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/ > CodeSizeOutliner.cpp > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6112d84e/attachment.html>
River Riddle via llvm-dev
2017-Jul-25  04:25 UTC
[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
Hey Sean, The bit about the attributes is good to know. When LTO is enabled the early run will still run per-TU but the late run will be shifted to work on the full LTO bitcode. Also I don't have specific numbers on how often parameterization is utilized but I can assure you that it's a majority of the time. I'll look into transforming the data into a visual format since its in google docs anyways. Thanks, River Riddle On Mon, Jul 24, 2017 at 9:17 PM Sean Silva <chisophugis at gmail.com> wrote:> On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve >> been working on an interprocedural outlining (code folding) pass for code >> size improvement at the IR level. We hit a couple of use cases that the >> current code size solutions didn’t handle well enough. Outlining is one of >> the avenues that seemed potentially beneficial. >> >> -- Algorithmic Approach -- >> >> The general implementation can be explained in stages: >> >> * Candidate Selection: >> >> Each instruction in the module is mapped to a congruence class based upon >> relaxed equivalency constraints. For most instructions this is simply: The >> type, opcode, and operand types. The constraints are tightened for >> instructions with special state that require more exact equivalence (e.g. >> ShuffleVector requires a constant mask vector). Candidates are then found >> by constructing a suffix array and lcp(longest common prefix) array. >> Walking the lcp array gives us congruent chains of instructions within the >> module. >> >> * Candidate Analysis: >> >> A verification step splits candidates that have different internal input >> sequences or incompatible parent function attributes between occurrences. >> An example of incompatible internal inputs sequences is: >> >> X = W + 6; vs X = W + 6; >> >> Y = X + 4; Y = W + 4; >> >> The above two occurrences would need special control flow to exist within >> the same outlined function. >> >> During analysis candidates have their inputs and outputs computed along >> with an estimated benefit from extraction. During input calculation a >> constant folding step removes all inputs that are the same amongst all >> occurrences. >> >> * Candidate Pruning: >> >> Overlapping candidates are pruned with a generic greedy algorithm that >> picks occurrences starting from the most beneficial candidates. >> >> * Candidate Outlining: >> >> Non pruned candidates are then outlined. Outputs from a candidate are >> returned via an output parameter, except for one output that is promoted to >> a return value. During outlining the inputs into the candidate are >> condensed by computing the equivalencies between the arguments at each >> occurrence. An example of this is: >> >> outlinedFn(1,6,1); -> outlinedFn(1,6); >> >> outlinedFn(2,4,2); -> outlinedFn(2,4); >> >> In the above, parameters 1 and 3 were found to be equivalent for all >> occurrences, thus the amount of inputs was reduced to 2. >> >> * Debug Info: >> >> Debug information is preserved for the calls to functions which have been >> outlined but all debug info from the original outlined portions is removed, >> making them harder to debug. >> >> * Profile Info: >> >> If the pass is running at Os the outliner will only consider cold blocks, >> whereas Oz considers all blocks that are not marked as hot. >> >> * Location in Pipeline: >> >> The pass is currently configured to run very late in the optimization >> pipeline. It is intended to run at Oz but will also run at Os if there is >> profile data available. The pass can optionally be run twice, once before >> function simplification and then again at the default location. This run is >> optional because you are gambling the potential benefits of redundancy >> elimination vs the potential benefits from function simplification. This >> can lead to large benefits or regressions depending on the benchmark >> (though the benefits tend to outnumber the regressions). The performance >> section contains data for both on a variety of benchmarks. >> >> -- Why at the IR level -- >> >> The decision to place the outliner at the IR level comes from a couple of >> major factors: >> >> - Desire to be target independent >> >> - Most opportunities for congruency >> >> The downside to having this type of transformation be at the IR level is >> it means there will be less accuracy in the cost model - we can somewhat >> accurately model the cost per instruction but we can’t get information on >> how a window of instructions may lower. This can cause regressions >> depending on the platform/codebase, therefore to help alleviate this there >> are several tunable parameters for the cost model. >> >> -- Performance -- >> >> More results including clang, llvm-tblgen, and more specific numbers >> about benefits/regressions can be found in the notes section below. >> >> * Size Reduction: >> >> - Test Suite(X86_64): >> >> - Early+Late outlining provides a geomean of 10.5% reduction over >> clang Oz, with a largest improvement of ~67% and largest regression of >> ~7.5%. >> >> - Late outlining provides a geomean of 4.65% reduction, with a largest >> improvement of ~51% and largest regression of ~6.4%. >> >> - Spec 2006(X86_64) >> >> - Early+Late outlining provides a geomean reduction of 2.08%. >> >> - Late outlining provides 2.09%. >> >> - CSiBE(AArch64) >> >> - Early+Late outlining shows geomean reduction of around 3.5%. >> >> - Late outlining shows 3.1%. >> > > > It would be good to visualize these results. Maybe a bar chart like > https://goo.gl/qN2HqA from http://blog.llvm.org/ > 2016/06/thinlto-scalable-and-incremental-lto.html for SPEC? > > >> >> * Compile Time: >> >> Compile time was tested under test-suite with a multisample of 5. >> >> - Early+Late outlining >> >> - Many improvements with > 40% compile time reduction. >> >> - Few regressions. >> >> - Late outlining >> >> - Greatest improvement is ~7.8% >> >> - Greatest regression is ~4% with a difference of <0.02s >> >> Our explanation for the compile time reduction during early outlining is >> that due to the amount of redundancy reduction early in the optimization >> pipeline there is a reduction in the amount of instruction processing >> during the rest of the compilation. >> >> * Execution Time: >> >> Ran with a multisample of 5. >> >> - Test Suite: >> >> - Early+Late outlining has many regressions up to 97%. The greatest >> improvement was around 7.5%. >> >> - Late outlining also has several regressions up to 44% and a greatest >> improvement of around 5.3%. >> >> - Spec: >> >> - Early+Late has a geomean regression of 3.5%. >> >> - Late outlining has a geomean regression of 1.25%. >> >> The execution time results are to be expected given that the outliner, >> without profile data, will extract from whatever region it deems >> profitable. Extracting from the hot path can lead to a noticeable >> performance regression on any platform, which can be somewhat avoided by >> providing profile data during outlining. >> >> -- Tested Improvements -- >> >> * MergeFunctions: >> >> - No noticeable benefit. >> >> * LTO: >> >> - LTO doesn’t have a code size pipeline, but %reductions over LTO are >> comparable to non LTO. >> > > -Os/-Oz are communicated through the optsize and minsize attributes. There > isn't a specific code size pipeline per se (I think this is true for per-TU > compilation as well, though I would have to check). > > Also, can you clarify what you mean by "LTO"? I assume this means that the > outliner did not run during per-TU compilation but did run on the combined > FullLTO bitcode, but want to check to be sure. > > -- Sean Silva > > >> * Input/Output Partitioning: >> >> -This identifies inputs/outputs that may be folded by splitting a >> candidate. The benefit is minimal for the computations required. >> >> * Similar Candidate Merging: >> >> - The benefit to be gained is currently not worth the large complexity >> required to catch the desired cases. >> >> -- Potential Improvements -- >> >> * Suffix&LCP array construction: The pass currently uses a very basic >> implementation that could be improved. There are definitely faster >> algorithms and some can construct both simultaneously. We will investigate >> this as a potential benefit for compile time in the future. >> >> * Os performance tuning: Under -Os the pass currently only runs on cold >> blocks. Ideally we could expand this to be a little more lax on less >> frequently executed blocks that aren’t cold. >> >> * Candidate Selection: The algorithm currently focuses on the longest >> common sequences. More checks could be added to see if shortening the >> sequence length produces a larger benefit(e.g less inputs/outputs). This >> would likely lead to an increase in compile time but should remain less >> than the baseline. >> >> * Non Exact Functions: The outliner currently ignores functions that do >> not have an exact definition. >> >> -- -- >> >> * CSiBE(Code Size Benchmark): >> >> www.csibe.org >> >> * More detailed performance data: >> >> goo.gl/5k6wsP >> >> * Implementation: >> >> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/ >> CodeSizeOutliner.cpp >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6c427e9d/attachment.html>
Possibly Parallel Threads
- [RFC] Add IR level interprocedural outliner for code size.
- [LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?
- [LLVMdev] IR Passes and TargetTransformInfo: Straw Man
- [RFC] Add IR level interprocedural outliner for code size.
- [LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives