thr3ads.net - llvm dev - [llvm-dev] [RFC] Interprocedural MIR-level outlining pass [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Jessica Paquette via llvm-dev

2016-Aug-26 21:26 UTC

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Hi everyone,

Since I haven't said anything on the mailing list before, a quick
introduction. I'm an intern at Apple, and over the summer I implemented a
prototype for an outlining pass for code size in LLVM. Now I'm seeking to
eventually upstream it. I have the preliminary code on GitHub right now,
but it's still very prototypical (see the code section).

===============================Motivation
===============================The goal of the internship was to create an
interprocedural pass that
would reduce code size as much as possible, perhaps at the cost of some
performance. This would be useful to, say, embedded programmers who only
have a few kilobytes to work with and a substantial amount of code to fit
in that space.


===============================Approach and Initial Results
===============================To do this, we chose to implement an outliner.
Outliners find sequences of
instructions which would be better off as a function call, by some measure
of "better". In this case, the measure of "better" is
"makes code
smaller".


===============================Results
===============================These results are from a fairly recent 64-bit
Intel processor, using a
version of Clang equipped with the outliner prototype versus an equivalent
version of Clang without the outliner.

CODE SIZE REDUCTION
For tests >=4 Kb in non-outlined size, the outliner currently provides an
average of 12.94% code size reduction on the LLVM test suite in comparison
to a default Clang, up to 51.4% code size reduction. In comparison to a
Clang with -Oz, the outliner provides an average of a 1.29% code size
reduction, up to a 37% code size reduction. I believe that the -Oz numbers
can be further improved by better tuning the outlining cost model.

EXECUTION TIME IMPACT
On average, the outliner increases execution time by 2% on the LLVM test
suite, but has been also shown to improve exection time by up to 16%.
These results were from a fairly recent Intel processor, so the results
may vary. Recent Intel processors have very low latency for function
calls, which may impact these results. Execution time improvements are
likely dependent on the latency of function calls, instruction caching
behaviour, and the execution frequency of the code being outlined. In
partucular, using a processor with heavy function call latency will likely
increase execution time overhead.


===============================Implementation
===============================The outliner, in its current state, is a
MachineModulePass. It finds
*identical* sequences of MIR, after register allocation, and pulls them
out into their own functions. Thus, it's effectively assembly-level.
Ultimately, the algorithm used is general, so it can sit anywhere, but MIR
was very convenient for the time being.

It requires two data structures.

1. A generalized suffix tree
2. A "terminated string"

1: The generalized suffix tree is constructed using Ukkonen's linear time
construction algorithm [1]. They require linear space and support
linear-time substring queries. In practice, the construction time for the
suffix tree is the most time consuming part, but I haven't noticed a
difference in compilation time on, say, 12 MB .ll files.

2: To support the suffix tree, we need a "terminated string." This is
a
generalized string with an unique terminator character appended to the
end. TerminatedStrings can be built from any type.

The algorithm is then roughly as follows.

1. For each MachineBasicBlock in the program, build a TerminatedString for
that block.
2. Build a suffix tree for that collection of strings.
3. Query the suffix tree for the longest repeated substring and place that
string in a candidate list. Repeat until none are found.
4. Create functions for each candidate.
5. Replace each candidate with a call to its respective function.

Currently, the program itself isn't stored in the suffix tree, but rather
a "proxy string" of integers. This isn't necessary at the MIR
level, but
may be for an IR level extension of the algorithm.


===============================Challenges
===============================1) MEMORY CONSUMPTION
Given a string of length n, a naive suffix tree implementation can take up
to 40n bytes of memory. However, this number can be reduced to 20n with a
bit of work [2]. Currently, the suffix tree stores the entire program,
including instructions which really ought not to be outlined, such as
returns. These instructions should not be included in the final
implementation, but should rather act as terminators for the strings. This
will likely curb memory consumption. Suffix trees have been used in the
past in sliding-window-based compression schemes, which may serve as a
source of inspiration for reducing memory overhead.[3]

Nonetheless, the outliner probably shouldn't be run unless it really has
to be run. It will likely be used mostly in embedded spaces, where the
programs have to fit into small devices anyway. Thus, memory overhead for
the compiler likely won't be a problem. The outliner should only be used
in -Oz compilations, and possibly should have its own flag.


2) EXECUTION TIME
Currently, the outliner isn't tuned for preventing execution time
increases. There is an average of a 2% execution time hit on the tests in
the LLVM test suite, with a few outliers of up to 30%. The outliers are
tests which contain hot loops. The outliner really ought to be able to use
profiling information and not outline from hot areas. Another suggestion
people have given me is to add a "never outline" directive which would
allow the user to say something along the lines of "this is a hot loop,
please never outline from it".

It's also important to note that these numbers are coming from a fairly
recent Intel processor.


3) CONSTRAINTS ON INSTRUCTIONS
The outliner currently won't pull anything out of functions which use a
red zone. It also won't pull anything out that uses the stack, instruction
pointer, uses constant pool indices, CFI indices, jump table indices, or
frame indices. This removes many opportunities for outlining which would
likely be available at a higher level (such as IR). Thus, there's a case
for moving this up to a higher level.


===============================Additional Applications
===============================The suffix tree itself could be used as a tool
for finding opportunities
to refactor code. For example, it could recognize places where the user
likely copied and pasted some code. This could be run on codebases to find
areas where people could manually outline things at the source level.

Using the terminated string class, it would also be possible to implement
other string algorithms on code. This may open the door to new ways to
analyze existing codebases.


===============================Roadmap
===============================The current outliner is *very* prototypical. The
version I would want to
upstream would be a new implementation. Here's what I'd like to address
and accomplish.

1. Ask "what does the LLVM community want from an outliner" and use
that
to drive development of the algorithm.
2. Reimplement the outliner, perhaps using a less memory-intensve data
structure like a suffix array.
3. Begin adding features to the algorithm, for example:
    i.   Teaching the algorithm about hot/cold blocks of code and taking
that into account.
    ii.  Simple parameter passing.
    iii. Similar function outlining-- eg, noticing that two outlining
candidates are similar and can be merged into one function with some
control flow.


===============================Code
===============================Note: This code requires MachineModulePasses

* Main pass:
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h

* Suffix tree:
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h

* TerminatedString and TerminatedStringList:
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h

Here are a couple unit tests for the data structures.

* Suffix tree unit tests:
https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp

* TerminatedString unit tests:
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp

* TerminatedStringList unit tests:
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp


===============================References
===============================[1] Ukkonen's Algorithm:
https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
[2] Suffix Trees and Suffix Arrays:
http://web.cs.iastate.edu/~cs548/suffix.pdf
[3] Extended Application of Suffix Trees to Data Compression:
http://www.larsson.dogma.net/dccpaper.pdf


Thanks for reading,
Jessica

Hal Finkel via llvm-dev

2016-Aug-26 21:36 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Hi Jessica,

This is quite interesting.

Can you comment on why you started by doing this at the MI level, as opposed to
the IR level. And at the MI level, why after RA instead of before RA?

Perhaps the first question is another way of asking about how accurately we can
model call-site code size at the IR level?

Thanks in advance,
Hal

----- Original Message -----> From: "Jessica Paquette via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: llvm-dev at lists.llvm.org
> Sent: Friday, August 26, 2016 4:26:09 PM
> Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
> 
> Hi everyone,
> 
> Since I haven't said anything on the mailing list before, a quick
> introduction. I'm an intern at Apple, and over the summer I
> implemented a
> prototype for an outlining pass for code size in LLVM. Now I'm
> seeking to
> eventually upstream it. I have the preliminary code on GitHub right
> now,
> but it's still very prototypical (see the code section).
> 
> ===============================> Motivation
> ===============================> The goal of the internship was to
create an interprocedural pass that
> would reduce code size as much as possible, perhaps at the cost of
> some
> performance. This would be useful to, say, embedded programmers who
> only
> have a few kilobytes to work with and a substantial amount of code to
> fit
> in that space.
> 
> 
> ===============================> Approach and Initial Results
> ===============================> To do this, we chose to implement an
outliner. Outliners find
> sequences of
> instructions which would be better off as a function call, by some
> measure
> of "better". In this case, the measure of "better" is
"makes code
> smaller".
> 
> 
> ===============================> Results
> ===============================> These results are from a fairly recent
64-bit Intel processor, using
> a
> version of Clang equipped with the outliner prototype versus an
> equivalent
> version of Clang without the outliner.
> 
> CODE SIZE REDUCTION
> For tests >=4 Kb in non-outlined size, the outliner currently
> provides an
> average of 12.94% code size reduction on the LLVM test suite in
> comparison
> to a default Clang, up to 51.4% code size reduction. In comparison to
> a
> Clang with -Oz, the outliner provides an average of a 1.29% code size
> reduction, up to a 37% code size reduction. I believe that the -Oz
> numbers
> can be further improved by better tuning the outlining cost model.
> 
> EXECUTION TIME IMPACT
> On average, the outliner increases execution time by 2% on the LLVM
> test
> suite, but has been also shown to improve exection time by up to 16%.
> These results were from a fairly recent Intel processor, so the
> results
> may vary. Recent Intel processors have very low latency for function
> calls, which may impact these results. Execution time improvements
> are
> likely dependent on the latency of function calls, instruction
> caching
> behaviour, and the execution frequency of the code being outlined. In
> partucular, using a processor with heavy function call latency will
> likely
> increase execution time overhead.
> 
> 
> ===============================> Implementation
> ===============================> The outliner, in its current state, is
a MachineModulePass. It finds
> *identical* sequences of MIR, after register allocation, and pulls
> them
> out into their own functions. Thus, it's effectively assembly-level.
> Ultimately, the algorithm used is general, so it can sit anywhere,
> but MIR
> was very convenient for the time being.
> 
> It requires two data structures.
> 
> 1. A generalized suffix tree
> 2. A "terminated string"
> 
> 1: The generalized suffix tree is constructed using Ukkonen's linear
> time
> construction algorithm [1]. They require linear space and support
> linear-time substring queries. In practice, the construction time for
> the
> suffix tree is the most time consuming part, but I haven't noticed a
> difference in compilation time on, say, 12 MB .ll files.
> 
> 2: To support the suffix tree, we need a "terminated string."
This is
> a
> generalized string with an unique terminator character appended to
> the
> end. TerminatedStrings can be built from any type.
> 
> The algorithm is then roughly as follows.
> 
> 1. For each MachineBasicBlock in the program, build a
> TerminatedString for
> that block.
> 2. Build a suffix tree for that collection of strings.
> 3. Query the suffix tree for the longest repeated substring and place
> that
> string in a candidate list. Repeat until none are found.
> 4. Create functions for each candidate.
> 5. Replace each candidate with a call to its respective function.
> 
> Currently, the program itself isn't stored in the suffix tree, but
> rather
> a "proxy string" of integers. This isn't necessary at the MIR
level,
> but
> may be for an IR level extension of the algorithm.
> 
> 
> ===============================> Challenges
> ===============================> 1) MEMORY CONSUMPTION
> Given a string of length n, a naive suffix tree implementation can
> take up
> to 40n bytes of memory. However, this number can be reduced to 20n
> with a
> bit of work [2]. Currently, the suffix tree stores the entire
> program,
> including instructions which really ought not to be outlined, such as
> returns. These instructions should not be included in the final
> implementation, but should rather act as terminators for the strings.
> This
> will likely curb memory consumption. Suffix trees have been used in
> the
> past in sliding-window-based compression schemes, which may serve as
> a
> source of inspiration for reducing memory overhead.[3]
> 
> Nonetheless, the outliner probably shouldn't be run unless it really
> has
> to be run. It will likely be used mostly in embedded spaces, where
> the
> programs have to fit into small devices anyway. Thus, memory overhead
> for
> the compiler likely won't be a problem. The outliner should only be
> used
> in -Oz compilations, and possibly should have its own flag.
> 
> 
> 2) EXECUTION TIME
> Currently, the outliner isn't tuned for preventing execution time
> increases. There is an average of a 2% execution time hit on the
> tests in
> the LLVM test suite, with a few outliers of up to 30%. The outliers
> are
> tests which contain hot loops. The outliner really ought to be able
> to use
> profiling information and not outline from hot areas. Another
> suggestion
> people have given me is to add a "never outline" directive which
> would
> allow the user to say something along the lines of "this is a hot
> loop,
> please never outline from it".
> 
> It's also important to note that these numbers are coming from a
> fairly
> recent Intel processor.
> 
> 
> 3) CONSTRAINTS ON INSTRUCTIONS
> The outliner currently won't pull anything out of functions which use
> a
> red zone. It also won't pull anything out that uses the stack,
> instruction
> pointer, uses constant pool indices, CFI indices, jump table indices,
> or
> frame indices. This removes many opportunities for outlining which
> would
> likely be available at a higher level (such as IR). Thus, there's a
> case
> for moving this up to a higher level.
> 
> 
> ===============================> Additional Applications
> ===============================> The suffix tree itself could be used as
a tool for finding
> opportunities
> to refactor code. For example, it could recognize places where the
> user
> likely copied and pasted some code. This could be run on codebases to
> find
> areas where people could manually outline things at the source level.
> 
> Using the terminated string class, it would also be possible to
> implement
> other string algorithms on code. This may open the door to new ways
> to
> analyze existing codebases.
> 
> 
> ===============================> Roadmap
> ===============================> The current outliner is *very*
prototypical. The version I would want
> to
> upstream would be a new implementation. Here's what I'd like to
> address
> and accomplish.
> 
> 1. Ask "what does the LLVM community want from an outliner" and
use
> that
> to drive development of the algorithm.
> 2. Reimplement the outliner, perhaps using a less memory-intensve
> data
> structure like a suffix array.
> 3. Begin adding features to the algorithm, for example:
>     i.   Teaching the algorithm about hot/cold blocks of code and
>     taking
> that into account.
>     ii.  Simple parameter passing.
>     iii. Similar function outlining-- eg, noticing that two outlining
> candidates are similar and can be merged into one function with some
> control flow.
> 
> 
> ===============================> Code
> ===============================> Note: This code requires
MachineModulePasses
> 
> * Main pass:
> https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
> 
> * Suffix tree:
> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
> 
> * TerminatedString and TerminatedStringList:
>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
> 
> Here are a couple unit tests for the data structures.
> 
> * Suffix tree unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
> 
> * TerminatedString unit tests:
>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
> 
> * TerminatedStringList unit tests:
>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
> 
> 
> ===============================> References
> ===============================> [1] Ukkonen's Algorithm:
> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> [2] Suffix Trees and Suffix Arrays:
> http://web.cs.iastate.edu/~cs548/suffix.pdf
> [3] Extended Application of Suffix Trees to Data Compression:
> http://www.larsson.dogma.net/dccpaper.pdf
> 
> 
> Thanks for reading,
> Jessica
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Kevin Choi via llvm-dev

2016-Aug-26 21:55 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

I think the "Motivation" section explained that. I too first thought
about
"why not at IR?" but the reason looks like MIR, post-RA has the most
accurate heuristics (best way to know looks like actually getting there).

Do you know if there is any experimental pass that relies on deriving
heuristics by a feedback loop after letting, ie. a duplicate
module/function/block continue past?

Regards,
Kevin

On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> Hi Jessica,
>
> This is quite interesting.
>
> Can you comment on why you started by doing this at the MI level, as
> opposed to the IR level. And at the MI level, why after RA instead of
> before RA?
>
> Perhaps the first question is another way of asking about how accurately
> we can model call-site code size at the IR level?
>
> Thanks in advance,
> Hal
>
> ----- Original Message -----
> > From: "Jessica Paquette via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: llvm-dev at lists.llvm.org
> > Sent: Friday, August 26, 2016 4:26:09 PM
> > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
> >
> > Hi everyone,
> >
> > Since I haven't said anything on the mailing list before, a quick
> > introduction. I'm an intern at Apple, and over the summer I
> > implemented a
> > prototype for an outlining pass for code size in LLVM. Now I'm
> > seeking to
> > eventually upstream it. I have the preliminary code on GitHub right
> > now,
> > but it's still very prototypical (see the code section).
> >
> > ===============================> > Motivation
> > ===============================> > The goal of the internship
was to create an interprocedural pass that
> > would reduce code size as much as possible, perhaps at the cost of
> > some
> > performance. This would be useful to, say, embedded programmers who
> > only
> > have a few kilobytes to work with and a substantial amount of code to
> > fit
> > in that space.
> >
> >
> > ===============================> > Approach and Initial Results
> > ===============================> > To do this, we chose to
implement an outliner. Outliners find
> > sequences of
> > instructions which would be better off as a function call, by some
> > measure
> > of "better". In this case, the measure of "better"
is "makes code
> > smaller".
> >
> >
> > ===============================> > Results
> > ===============================> > These results are from a
fairly recent 64-bit Intel processor, using
> > a
> > version of Clang equipped with the outliner prototype versus an
> > equivalent
> > version of Clang without the outliner.
> >
> > CODE SIZE REDUCTION
> > For tests >=4 Kb in non-outlined size, the outliner currently
> > provides an
> > average of 12.94% code size reduction on the LLVM test suite in
> > comparison
> > to a default Clang, up to 51.4% code size reduction. In comparison to
> > a
> > Clang with -Oz, the outliner provides an average of a 1.29% code size
> > reduction, up to a 37% code size reduction. I believe that the -Oz
> > numbers
> > can be further improved by better tuning the outlining cost model.
> >
> > EXECUTION TIME IMPACT
> > On average, the outliner increases execution time by 2% on the LLVM
> > test
> > suite, but has been also shown to improve exection time by up to 16%.
> > These results were from a fairly recent Intel processor, so the
> > results
> > may vary. Recent Intel processors have very low latency for function
> > calls, which may impact these results. Execution time improvements
> > are
> > likely dependent on the latency of function calls, instruction
> > caching
> > behaviour, and the execution frequency of the code being outlined. In
> > partucular, using a processor with heavy function call latency will
> > likely
> > increase execution time overhead.
> >
> >
> > ===============================> > Implementation
> > ===============================> > The outliner, in its current
state, is a MachineModulePass. It finds
> > *identical* sequences of MIR, after register allocation, and pulls
> > them
> > out into their own functions. Thus, it's effectively
assembly-level.
> > Ultimately, the algorithm used is general, so it can sit anywhere,
> > but MIR
> > was very convenient for the time being.
> >
> > It requires two data structures.
> >
> > 1. A generalized suffix tree
> > 2. A "terminated string"
> >
> > 1: The generalized suffix tree is constructed using Ukkonen's
linear
> > time
> > construction algorithm [1]. They require linear space and support
> > linear-time substring queries. In practice, the construction time for
> > the
> > suffix tree is the most time consuming part, but I haven't noticed
a
> > difference in compilation time on, say, 12 MB .ll files.
> >
> > 2: To support the suffix tree, we need a "terminated
string." This is
> > a
> > generalized string with an unique terminator character appended to
> > the
> > end. TerminatedStrings can be built from any type.
> >
> > The algorithm is then roughly as follows.
> >
> > 1. For each MachineBasicBlock in the program, build a
> > TerminatedString for
> > that block.
> > 2. Build a suffix tree for that collection of strings.
> > 3. Query the suffix tree for the longest repeated substring and place
> > that
> > string in a candidate list. Repeat until none are found.
> > 4. Create functions for each candidate.
> > 5. Replace each candidate with a call to its respective function.
> >
> > Currently, the program itself isn't stored in the suffix tree, but
> > rather
> > a "proxy string" of integers. This isn't necessary at
the MIR level,
> > but
> > may be for an IR level extension of the algorithm.
> >
> >
> > ===============================> > Challenges
> > ===============================> > 1) MEMORY CONSUMPTION
> > Given a string of length n, a naive suffix tree implementation can
> > take up
> > to 40n bytes of memory. However, this number can be reduced to 20n
> > with a
> > bit of work [2]. Currently, the suffix tree stores the entire
> > program,
> > including instructions which really ought not to be outlined, such as
> > returns. These instructions should not be included in the final
> > implementation, but should rather act as terminators for the strings.
> > This
> > will likely curb memory consumption. Suffix trees have been used in
> > the
> > past in sliding-window-based compression schemes, which may serve as
> > a
> > source of inspiration for reducing memory overhead.[3]
> >
> > Nonetheless, the outliner probably shouldn't be run unless it
really
> > has
> > to be run. It will likely be used mostly in embedded spaces, where
> > the
> > programs have to fit into small devices anyway. Thus, memory overhead
> > for
> > the compiler likely won't be a problem. The outliner should only
be
> > used
> > in -Oz compilations, and possibly should have its own flag.
> >
> >
> > 2) EXECUTION TIME
> > Currently, the outliner isn't tuned for preventing execution time
> > increases. There is an average of a 2% execution time hit on the
> > tests in
> > the LLVM test suite, with a few outliers of up to 30%. The outliers
> > are
> > tests which contain hot loops. The outliner really ought to be able
> > to use
> > profiling information and not outline from hot areas. Another
> > suggestion
> > people have given me is to add a "never outline" directive
which
> > would
> > allow the user to say something along the lines of "this is a hot
> > loop,
> > please never outline from it".
> >
> > It's also important to note that these numbers are coming from a
> > fairly
> > recent Intel processor.
> >
> >
> > 3) CONSTRAINTS ON INSTRUCTIONS
> > The outliner currently won't pull anything out of functions which
use
> > a
> > red zone. It also won't pull anything out that uses the stack,
> > instruction
> > pointer, uses constant pool indices, CFI indices, jump table indices,
> > or
> > frame indices. This removes many opportunities for outlining which
> > would
> > likely be available at a higher level (such as IR). Thus, there's
a
> > case
> > for moving this up to a higher level.
> >
> >
> > ===============================> > Additional Applications
> > ===============================> > The suffix tree itself could
be used as a tool for finding
> > opportunities
> > to refactor code. For example, it could recognize places where the
> > user
> > likely copied and pasted some code. This could be run on codebases to
> > find
> > areas where people could manually outline things at the source level.
> >
> > Using the terminated string class, it would also be possible to
> > implement
> > other string algorithms on code. This may open the door to new ways
> > to
> > analyze existing codebases.
> >
> >
> > ===============================> > Roadmap
> > ===============================> > The current outliner is
*very* prototypical. The version I would want
> > to
> > upstream would be a new implementation. Here's what I'd like
to
> > address
> > and accomplish.
> >
> > 1. Ask "what does the LLVM community want from an outliner"
and use
> > that
> > to drive development of the algorithm.
> > 2. Reimplement the outliner, perhaps using a less memory-intensve
> > data
> > structure like a suffix array.
> > 3. Begin adding features to the algorithm, for example:
> >     i.   Teaching the algorithm about hot/cold blocks of code and
> >     taking
> > that into account.
> >     ii.  Simple parameter passing.
> >     iii. Similar function outlining-- eg, noticing that two outlining
> > candidates are similar and can be merged into one function with some
> > control flow.
> >
> >
> > ===============================> > Code
> > ===============================> > Note: This code requires
MachineModulePasses
> >
> > * Main pass:
> >
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
> >
> > * Suffix tree:
> >
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
> >
> > * TerminatedString and TerminatedStringList:
> > https://github.com/ornata/llvm/blob/master/include/llvm/
> ADT/TerminatedString.h
> >
> > Here are a couple unit tests for the data structures.
> >
> > * Suffix tree unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/SuffixTreeTest.cpp
> >
> > * TerminatedString unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringTest.cpp
> >
> > * TerminatedStringList unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringListTest.cpp
> >
> >
> > ===============================> > References
> > ===============================> > [1] Ukkonen's Algorithm:
> > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> > [2] Suffix Trees and Suffix Arrays:
> > http://web.cs.iastate.edu/~cs548/suffix.pdf
> > [3] Extended Application of Suffix Trees to Data Compression:
> > http://www.larsson.dogma.net/dccpaper.pdf
> >
> >
> > Thanks for reading,
> > Jessica
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/4bc04cb6/attachment-0001.html>

Philip Reames via llvm-dev

2016-Aug-27 20:38 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Thanks for a well written and informative writeup.  This was a pleasure 
to read and laid out the issues well for consideration.

On a technical level, I could see this having a secondary use for 
developers of the compiler itself.  Seeing the same pattern of assembly 
many times across a large binary is often a hint of a missed 
optimization.  Using this type of technique to generate candidates for 
further investigation seems interesting.

Philip

On 08/26/2016 02:26 PM, Jessica Paquette via llvm-dev
wrote:> Hi everyone,
>
> Since I haven't said anything on the mailing list before, a quick
> introduction. I'm an intern at Apple, and over the summer I implemented
a
> prototype for an outlining pass for code size in LLVM. Now I'm seeking
to
> eventually upstream it. I have the preliminary code on GitHub right now,
> but it's still very prototypical (see the code section).
>
> ===============================> Motivation
> ===============================> The goal of the internship was to
create an interprocedural pass that
> would reduce code size as much as possible, perhaps at the cost of some
> performance. This would be useful to, say, embedded programmers who only
> have a few kilobytes to work with and a substantial amount of code to fit
> in that space.
>
>
> ===============================> Approach and Initial Results
> ===============================> To do this, we chose to implement an
outliner. Outliners find sequences of
> instructions which would be better off as a function call, by some measure
> of "better". In this case, the measure of "better" is
"makes code
> smaller".
>
>
> ===============================> Results
> ===============================> These results are from a fairly recent
64-bit Intel processor, using a
> version of Clang equipped with the outliner prototype versus an equivalent
> version of Clang without the outliner.
>
> CODE SIZE REDUCTION
> For tests >=4 Kb in non-outlined size, the outliner currently provides
an
> average of 12.94% code size reduction on the LLVM test suite in comparison
> to a default Clang, up to 51.4% code size reduction. In comparison to a
> Clang with -Oz, the outliner provides an average of a 1.29% code size
> reduction, up to a 37% code size reduction. I believe that the -Oz numbers
> can be further improved by better tuning the outlining cost model.
>
> EXECUTION TIME IMPACT
> On average, the outliner increases execution time by 2% on the LLVM test
> suite, but has been also shown to improve exection time by up to 16%.
> These results were from a fairly recent Intel processor, so the results
> may vary. Recent Intel processors have very low latency for function
> calls, which may impact these results. Execution time improvements are
> likely dependent on the latency of function calls, instruction caching
> behaviour, and the execution frequency of the code being outlined. In
> partucular, using a processor with heavy function call latency will likely
> increase execution time overhead.
>
>
> ===============================> Implementation
> ===============================> The outliner, in its current state, is
a MachineModulePass. It finds
> *identical* sequences of MIR, after register allocation, and pulls them
> out into their own functions. Thus, it's effectively assembly-level.
> Ultimately, the algorithm used is general, so it can sit anywhere, but MIR
> was very convenient for the time being.
>
> It requires two data structures.
>
> 1. A generalized suffix tree
> 2. A "terminated string"
>
> 1: The generalized suffix tree is constructed using Ukkonen's linear
time
> construction algorithm [1]. They require linear space and support
> linear-time substring queries. In practice, the construction time for the
> suffix tree is the most time consuming part, but I haven't noticed a
> difference in compilation time on, say, 12 MB .ll files.
>
> 2: To support the suffix tree, we need a "terminated string."
This is a
> generalized string with an unique terminator character appended to the
> end. TerminatedStrings can be built from any type.
>
> The algorithm is then roughly as follows.
>
> 1. For each MachineBasicBlock in the program, build a TerminatedString for
> that block.
> 2. Build a suffix tree for that collection of strings.
> 3. Query the suffix tree for the longest repeated substring and place that
> string in a candidate list. Repeat until none are found.
> 4. Create functions for each candidate.
> 5. Replace each candidate with a call to its respective function.
>
> Currently, the program itself isn't stored in the suffix tree, but
rather
> a "proxy string" of integers. This isn't necessary at the MIR
level, but
> may be for an IR level extension of the algorithm.
>
>
> ===============================> Challenges
> ===============================> 1) MEMORY CONSUMPTION
> Given a string of length n, a naive suffix tree implementation can take up
> to 40n bytes of memory. However, this number can be reduced to 20n with a
> bit of work [2]. Currently, the suffix tree stores the entire program,
> including instructions which really ought not to be outlined, such as
> returns. These instructions should not be included in the final
> implementation, but should rather act as terminators for the strings. This
> will likely curb memory consumption. Suffix trees have been used in the
> past in sliding-window-based compression schemes, which may serve as a
> source of inspiration for reducing memory overhead.[3]
>
> Nonetheless, the outliner probably shouldn't be run unless it really
has
> to be run. It will likely be used mostly in embedded spaces, where the
> programs have to fit into small devices anyway. Thus, memory overhead for
> the compiler likely won't be a problem. The outliner should only be
used
> in -Oz compilations, and possibly should have its own flag.
>
>
> 2) EXECUTION TIME
> Currently, the outliner isn't tuned for preventing execution time
> increases. There is an average of a 2% execution time hit on the tests in
> the LLVM test suite, with a few outliers of up to 30%. The outliers are
> tests which contain hot loops. The outliner really ought to be able to use
> profiling information and not outline from hot areas. Another suggestion
> people have given me is to add a "never outline" directive which
would
> allow the user to say something along the lines of "this is a hot
loop,
> please never outline from it".
>
> It's also important to note that these numbers are coming from a fairly
> recent Intel processor.
>
>
> 3) CONSTRAINTS ON INSTRUCTIONS
> The outliner currently won't pull anything out of functions which use a
> red zone. It also won't pull anything out that uses the stack,
instruction
> pointer, uses constant pool indices, CFI indices, jump table indices, or
> frame indices. This removes many opportunities for outlining which would
> likely be available at a higher level (such as IR). Thus, there's a
case
> for moving this up to a higher level.
>
>
> ===============================> Additional Applications
> ===============================> The suffix tree itself could be used as
a tool for finding opportunities
> to refactor code. For example, it could recognize places where the user
> likely copied and pasted some code. This could be run on codebases to find
> areas where people could manually outline things at the source level.
>
> Using the terminated string class, it would also be possible to implement
> other string algorithms on code. This may open the door to new ways to
> analyze existing codebases.
>
>
> ===============================> Roadmap
> ===============================> The current outliner is *very*
prototypical. The version I would want to
> upstream would be a new implementation. Here's what I'd like to
address
> and accomplish.
>
> 1. Ask "what does the LLVM community want from an outliner" and
use that
> to drive development of the algorithm.
> 2. Reimplement the outliner, perhaps using a less memory-intensve data
> structure like a suffix array.
> 3. Begin adding features to the algorithm, for example:
>      i.   Teaching the algorithm about hot/cold blocks of code and taking
> that into account.
>      ii.  Simple parameter passing.
>      iii. Similar function outlining-- eg, noticing that two outlining
> candidates are similar and can be merged into one function with some
> control flow.
>
>
> ===============================> Code
> ===============================> Note: This code requires
MachineModulePasses
>
> * Main pass:
> https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>
> * Suffix tree:
> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>
> * TerminatedString and TerminatedStringList:
>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
>
> Here are a couple unit tests for the data structures.
>
> * Suffix tree unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
>
> * TerminatedString unit tests:
>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
>
> * TerminatedStringList unit tests:
>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
>
>
> ===============================> References
> ===============================> [1] Ukkonen's Algorithm:
> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> [2] Suffix Trees and Suffix Arrays:
> http://web.cs.iastate.edu/~cs548/suffix.pdf
> [3] Extended Application of Suffix Trees to Data Compression:
> http://www.larsson.dogma.net/dccpaper.pdf
>
>
> Thanks for reading,
> Jessica
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Jessica Paquette via llvm-dev

2016-Aug-28 04:11 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

That’s actually something that’s been brought up a few times! I’m thinking of
ways to use the suffix tree based approach as a tool for that, and maybe for
finding areas where code could be refactored. For example, people could find
sequences of instructions which appear suspiciously often, and deduce that they
are copied and pasted pieces of code.

It also may be interesting to investigate other string algorithms to see if they
may have similar applications. Techniques used in bioinformatics are interesting
to me in particular, since they work on finding structural similarities across
gene sequences. It’s not exactly the same thing, but it might be possible to
repurpose their techniques to find certain similarities across codebases.


- Jessica
> On Aug 27, 2016, at 1:38 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
> Thanks for a well written and informative writeup.  This was a pleasure to
read and laid out the issues well for consideration.
> 
> On a technical level, I could see this having a secondary use for
developers of the compiler itself.  Seeing the same pattern of assembly many
times across a large binary is often a hint of a missed optimization.  Using
this type of technique to generate candidates for further investigation seems
interesting.
> 
> Philip
> 
> On 08/26/2016 02:26 PM, Jessica Paquette via llvm-dev wrote:
>> Hi everyone,
>> 
>> Since I haven't said anything on the mailing list before, a quick
>> introduction. I'm an intern at Apple, and over the summer I
implemented a
>> prototype for an outlining pass for code size in LLVM. Now I'm
seeking to
>> eventually upstream it. I have the preliminary code on GitHub right
now,
>> but it's still very prototypical (see the code section).
>> 
>> ===============================>> Motivation
>> ===============================>> The goal of the internship was
to create an interprocedural pass that
>> would reduce code size as much as possible, perhaps at the cost of some
>> performance. This would be useful to, say, embedded programmers who
only
>> have a few kilobytes to work with and a substantial amount of code to
fit
>> in that space.
>> 
>> 
>> ===============================>> Approach and Initial Results
>> ===============================>> To do this, we chose to
implement an outliner. Outliners find sequences of
>> instructions which would be better off as a function call, by some
measure
>> of "better". In this case, the measure of "better"
is "makes code
>> smaller".
>> 
>> 
>> ===============================>> Results
>> ===============================>> These results are from a fairly
recent 64-bit Intel processor, using a
>> version of Clang equipped with the outliner prototype versus an
equivalent
>> version of Clang without the outliner.
>> 
>> CODE SIZE REDUCTION
>> For tests >=4 Kb in non-outlined size, the outliner currently
provides an
>> average of 12.94% code size reduction on the LLVM test suite in
comparison
>> to a default Clang, up to 51.4% code size reduction. In comparison to a
>> Clang with -Oz, the outliner provides an average of a 1.29% code size
>> reduction, up to a 37% code size reduction. I believe that the -Oz
numbers
>> can be further improved by better tuning the outlining cost model.
>> 
>> EXECUTION TIME IMPACT
>> On average, the outliner increases execution time by 2% on the LLVM
test
>> suite, but has been also shown to improve exection time by up to 16%.
>> These results were from a fairly recent Intel processor, so the results
>> may vary. Recent Intel processors have very low latency for function
>> calls, which may impact these results. Execution time improvements are
>> likely dependent on the latency of function calls, instruction caching
>> behaviour, and the execution frequency of the code being outlined. In
>> partucular, using a processor with heavy function call latency will
likely
>> increase execution time overhead.
>> 
>> 
>> ===============================>> Implementation
>> ===============================>> The outliner, in its current
state, is a MachineModulePass. It finds
>> *identical* sequences of MIR, after register allocation, and pulls them
>> out into their own functions. Thus, it's effectively
assembly-level.
>> Ultimately, the algorithm used is general, so it can sit anywhere, but
MIR
>> was very convenient for the time being.
>> 
>> It requires two data structures.
>> 
>> 1. A generalized suffix tree
>> 2. A "terminated string"
>> 
>> 1: The generalized suffix tree is constructed using Ukkonen's
linear time
>> construction algorithm [1]. They require linear space and support
>> linear-time substring queries. In practice, the construction time for
the
>> suffix tree is the most time consuming part, but I haven't noticed
a
>> difference in compilation time on, say, 12 MB .ll files.
>> 
>> 2: To support the suffix tree, we need a "terminated string."
This is a
>> generalized string with an unique terminator character appended to the
>> end. TerminatedStrings can be built from any type.
>> 
>> The algorithm is then roughly as follows.
>> 
>> 1. For each MachineBasicBlock in the program, build a TerminatedString
for
>> that block.
>> 2. Build a suffix tree for that collection of strings.
>> 3. Query the suffix tree for the longest repeated substring and place
that
>> string in a candidate list. Repeat until none are found.
>> 4. Create functions for each candidate.
>> 5. Replace each candidate with a call to its respective function.
>> 
>> Currently, the program itself isn't stored in the suffix tree, but
rather
>> a "proxy string" of integers. This isn't necessary at the
MIR level, but
>> may be for an IR level extension of the algorithm.
>> 
>> 
>> ===============================>> Challenges
>> ===============================>> 1) MEMORY CONSUMPTION
>> Given a string of length n, a naive suffix tree implementation can take
up
>> to 40n bytes of memory. However, this number can be reduced to 20n with
a
>> bit of work [2]. Currently, the suffix tree stores the entire program,
>> including instructions which really ought not to be outlined, such as
>> returns. These instructions should not be included in the final
>> implementation, but should rather act as terminators for the strings.
This
>> will likely curb memory consumption. Suffix trees have been used in the
>> past in sliding-window-based compression schemes, which may serve as a
>> source of inspiration for reducing memory overhead.[3]
>> 
>> Nonetheless, the outliner probably shouldn't be run unless it
really has
>> to be run. It will likely be used mostly in embedded spaces, where the
>> programs have to fit into small devices anyway. Thus, memory overhead
for
>> the compiler likely won't be a problem. The outliner should only be
used
>> in -Oz compilations, and possibly should have its own flag.
>> 
>> 
>> 2) EXECUTION TIME
>> Currently, the outliner isn't tuned for preventing execution time
>> increases. There is an average of a 2% execution time hit on the tests
in
>> the LLVM test suite, with a few outliers of up to 30%. The outliers are
>> tests which contain hot loops. The outliner really ought to be able to
use
>> profiling information and not outline from hot areas. Another
suggestion
>> people have given me is to add a "never outline" directive
which would
>> allow the user to say something along the lines of "this is a hot
loop,
>> please never outline from it".
>> 
>> It's also important to note that these numbers are coming from a
fairly
>> recent Intel processor.
>> 
>> 
>> 3) CONSTRAINTS ON INSTRUCTIONS
>> The outliner currently won't pull anything out of functions which
use a
>> red zone. It also won't pull anything out that uses the stack,
instruction
>> pointer, uses constant pool indices, CFI indices, jump table indices,
or
>> frame indices. This removes many opportunities for outlining which
would
>> likely be available at a higher level (such as IR). Thus, there's a
case
>> for moving this up to a higher level.
>> 
>> 
>> ===============================>> Additional Applications
>> ===============================>> The suffix tree itself could be
used as a tool for finding opportunities
>> to refactor code. For example, it could recognize places where the user
>> likely copied and pasted some code. This could be run on codebases to
find
>> areas where people could manually outline things at the source level.
>> 
>> Using the terminated string class, it would also be possible to
implement
>> other string algorithms on code. This may open the door to new ways to
>> analyze existing codebases.
>> 
>> 
>> ===============================>> Roadmap
>> ===============================>> The current outliner is *very*
prototypical. The version I would want to
>> upstream would be a new implementation. Here's what I'd like to
address
>> and accomplish.
>> 
>> 1. Ask "what does the LLVM community want from an outliner"
and use that
>> to drive development of the algorithm.
>> 2. Reimplement the outliner, perhaps using a less memory-intensve data
>> structure like a suffix array.
>> 3. Begin adding features to the algorithm, for example:
>>     i.   Teaching the algorithm about hot/cold blocks of code and
taking
>> that into account.
>>     ii.  Simple parameter passing.
>>     iii. Similar function outlining-- eg, noticing that two outlining
>> candidates are similar and can be merged into one function with some
>> control flow.
>> 
>> 
>> ===============================>> Code
>> ===============================>> Note: This code requires
MachineModulePasses
>> 
>> * Main pass:
>>
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>> 
>> * Suffix tree:
>>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>> 
>> * TerminatedString and TerminatedStringList:
>>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
>> 
>> Here are a couple unit tests for the data structures.
>> 
>> * Suffix tree unit tests:
>>
https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
>> 
>> * TerminatedString unit tests:
>>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
>> 
>> * TerminatedStringList unit tests:
>>
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
>> 
>> 
>> ===============================>> References
>> ===============================>> [1] Ukkonen's Algorithm:
>> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>> [2] Suffix Trees and Suffix Arrays:
>> http://web.cs.iastate.edu/~cs548/suffix.pdf
>> [3] Extended Application of Suffix Trees to Data Compression:
>> http://www.larsson.dogma.net/dccpaper.pdf
>> 
>> 
>> Thanks for reading,
>> Jessica
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Andrey Bokhanko via llvm-dev

2016-Aug-29 19:20 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Hi Jessica,

Let me echo others' comments -- this is a very interesting research project
and a nice write-up, indeed!

In order to transform it to a *practically* useful optimization pass,
though, "results" section should be strenghtened up a bit:
* You simply say "tests >= 4 Kb", without specifying the nature of
these
tests at all. Please tell exactly, along with exact compiler options you
used. If the tests are artificial, they might be good for debugging, but
worthless as a measure of usefulness of your pass.
* You use "a fairly recent 64-bit Intel processor" to test an
embedded-targeting optimization, which is odd. If you want to target x86
(good choice! :-)), perhaps a Quark chip might be a better choice? You can
measure code size impact without using a real hardware -- just add "-triple
i586-intel-elfiamcu" option.

Yours,
Andrey


On Sat, Aug 27, 2016 at 12:26 AM, Jessica Paquette via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi everyone,
>
> Since I haven't said anything on the mailing list before, a quick
> introduction. I'm an intern at Apple, and over the summer I implemented
a
> prototype for an outlining pass for code size in LLVM. Now I'm seeking
to
> eventually upstream it. I have the preliminary code on GitHub right now,
> but it's still very prototypical (see the code section).
>
> ===============================> Motivation
> ===============================> The goal of the internship was to
create an interprocedural pass that
> would reduce code size as much as possible, perhaps at the cost of some
> performance. This would be useful to, say, embedded programmers who only
> have a few kilobytes to work with and a substantial amount of code to fit
> in that space.
>
>
> ===============================> Approach and Initial Results
> ===============================> To do this, we chose to implement an
outliner. Outliners find sequences of
> instructions which would be better off as a function call, by some measure
> of "better". In this case, the measure of "better" is
"makes code
> smaller".
>
>
> ===============================> Results
> ===============================> These results are from a fairly recent
64-bit Intel processor, using a
> version of Clang equipped with the outliner prototype versus an equivalent
> version of Clang without the outliner.
>
> CODE SIZE REDUCTION
> For tests >=4 Kb in non-outlined size, the outliner currently provides
an
> average of 12.94% code size reduction on the LLVM test suite in comparison
> to a default Clang, up to 51.4% code size reduction. In comparison to a
> Clang with -Oz, the outliner provides an average of a 1.29% code size
> reduction, up to a 37% code size reduction. I believe that the -Oz numbers
> can be further improved by better tuning the outlining cost model.
>
> EXECUTION TIME IMPACT
> On average, the outliner increases execution time by 2% on the LLVM test
> suite, but has been also shown to improve exection time by up to 16%.
> These results were from a fairly recent Intel processor, so the results
> may vary. Recent Intel processors have very low latency for function
> calls, which may impact these results. Execution time improvements are
> likely dependent on the latency of function calls, instruction caching
> behaviour, and the execution frequency of the code being outlined. In
> partucular, using a processor with heavy function call latency will likely
> increase execution time overhead.
>
>
> ===============================> Implementation
> ===============================> The outliner, in its current state, is
a MachineModulePass. It finds
> *identical* sequences of MIR, after register allocation, and pulls them
> out into their own functions. Thus, it's effectively assembly-level.
> Ultimately, the algorithm used is general, so it can sit anywhere, but MIR
> was very convenient for the time being.
>
> It requires two data structures.
>
> 1. A generalized suffix tree
> 2. A "terminated string"
>
> 1: The generalized suffix tree is constructed using Ukkonen's linear
time
> construction algorithm [1]. They require linear space and support
> linear-time substring queries. In practice, the construction time for the
> suffix tree is the most time consuming part, but I haven't noticed a
> difference in compilation time on, say, 12 MB .ll files.
>
> 2: To support the suffix tree, we need a "terminated string."
This is a
> generalized string with an unique terminator character appended to the
> end. TerminatedStrings can be built from any type.
>
> The algorithm is then roughly as follows.
>
> 1. For each MachineBasicBlock in the program, build a TerminatedString for
> that block.
> 2. Build a suffix tree for that collection of strings.
> 3. Query the suffix tree for the longest repeated substring and place that
> string in a candidate list. Repeat until none are found.
> 4. Create functions for each candidate.
> 5. Replace each candidate with a call to its respective function.
>
> Currently, the program itself isn't stored in the suffix tree, but
rather
> a "proxy string" of integers. This isn't necessary at the MIR
level, but
> may be for an IR level extension of the algorithm.
>
>
> ===============================> Challenges
> ===============================> 1) MEMORY CONSUMPTION
> Given a string of length n, a naive suffix tree implementation can take up
> to 40n bytes of memory. However, this number can be reduced to 20n with a
> bit of work [2]. Currently, the suffix tree stores the entire program,
> including instructions which really ought not to be outlined, such as
> returns. These instructions should not be included in the final
> implementation, but should rather act as terminators for the strings. This
> will likely curb memory consumption. Suffix trees have been used in the
> past in sliding-window-based compression schemes, which may serve as a
> source of inspiration for reducing memory overhead.[3]
>
> Nonetheless, the outliner probably shouldn't be run unless it really
has
> to be run. It will likely be used mostly in embedded spaces, where the
> programs have to fit into small devices anyway. Thus, memory overhead for
> the compiler likely won't be a problem. The outliner should only be
used
> in -Oz compilations, and possibly should have its own flag.
>
>
> 2) EXECUTION TIME
> Currently, the outliner isn't tuned for preventing execution time
> increases. There is an average of a 2% execution time hit on the tests in
> the LLVM test suite, with a few outliers of up to 30%. The outliers are
> tests which contain hot loops. The outliner really ought to be able to use
> profiling information and not outline from hot areas. Another suggestion
> people have given me is to add a "never outline" directive which
would
> allow the user to say something along the lines of "this is a hot
loop,
> please never outline from it".
>
> It's also important to note that these numbers are coming from a fairly
> recent Intel processor.
>
>
> 3) CONSTRAINTS ON INSTRUCTIONS
> The outliner currently won't pull anything out of functions which use a
> red zone. It also won't pull anything out that uses the stack,
instruction
> pointer, uses constant pool indices, CFI indices, jump table indices, or
> frame indices. This removes many opportunities for outlining which would
> likely be available at a higher level (such as IR). Thus, there's a
case
> for moving this up to a higher level.
>
>
> ===============================> Additional Applications
> ===============================> The suffix tree itself could be used as
a tool for finding opportunities
> to refactor code. For example, it could recognize places where the user
> likely copied and pasted some code. This could be run on codebases to find
> areas where people could manually outline things at the source level.
>
> Using the terminated string class, it would also be possible to implement
> other string algorithms on code. This may open the door to new ways to
> analyze existing codebases.
>
>
> ===============================> Roadmap
> ===============================> The current outliner is *very*
prototypical. The version I would want to
> upstream would be a new implementation. Here's what I'd like to
address
> and accomplish.
>
> 1. Ask "what does the LLVM community want from an outliner" and
use that
> to drive development of the algorithm.
> 2. Reimplement the outliner, perhaps using a less memory-intensve data
> structure like a suffix array.
> 3. Begin adding features to the algorithm, for example:
>     i.   Teaching the algorithm about hot/cold blocks of code and taking
> that into account.
>     ii.  Simple parameter passing.
>     iii. Similar function outlining-- eg, noticing that two outlining
> candidates are similar and can be merged into one function with some
> control flow.
>
>
> ===============================> Code
> ===============================> Note: This code requires
MachineModulePasses
>
> * Main pass:
> https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>
> * Suffix tree:
> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>
> * TerminatedString and TerminatedStringList:
> https://github.com/ornata/llvm/blob/master/include/llvm/
> ADT/TerminatedString.h
>
> Here are a couple unit tests for the data structures.
>
> * Suffix tree unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/
> ADT/SuffixTreeTest.cpp
>
> * TerminatedString unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringTest.cpp
>
> * TerminatedStringList unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringListTest.cpp
>
>
> ===============================> References
> ===============================> [1] Ukkonen's Algorithm:
> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> [2] Suffix Trees and Suffix Arrays:
> http://web.cs.iastate.edu/~cs548/suffix.pdf
> [3] Extended Application of Suffix Trees to Data Compression:
> http://www.larsson.dogma.net/dccpaper.pdf
>
>
> Thanks for reading,
> Jessica
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160829/7f2629ad/attachment.html>

Andrey Bokhanko via llvm-dev

2016-Aug-29 19:22 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

And one more thing: have you measured impact of your pass with
MergeFunctions pass enabled? AFAIK, it's disabled by default (though I
might be mistaken).

Yours,
Andrey


On Mon, Aug 29, 2016 at 10:20 PM, Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi Jessica,
>
> Let me echo others' comments -- this is a very interesting research
> project and a nice write-up, indeed!
>
> In order to transform it to a *practically* useful optimization pass,
> though, "results" section should be strenghtened up a bit:
> * You simply say "tests >= 4 Kb", without specifying the
nature of these
> tests at all. Please tell exactly, along with exact compiler options you
> used. If the tests are artificial, they might be good for debugging, but
> worthless as a measure of usefulness of your pass.
> * You use "a fairly recent 64-bit Intel processor" to test an
> embedded-targeting optimization, which is odd. If you want to target x86
> (good choice! :-)), perhaps a Quark chip might be a better choice? You can
> measure code size impact without using a real hardware -- just add
"-triple
> i586-intel-elfiamcu" option.
>
> Yours,
> Andrey
>
>
> On Sat, Aug 27, 2016 at 12:26 AM, Jessica Paquette via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi everyone,
>>
>> Since I haven't said anything on the mailing list before, a quick
>> introduction. I'm an intern at Apple, and over the summer I
implemented a
>> prototype for an outlining pass for code size in LLVM. Now I'm
seeking to
>> eventually upstream it. I have the preliminary code on GitHub right
now,
>> but it's still very prototypical (see the code section).
>>
>> ===============================>> Motivation
>> ===============================>> The goal of the internship was
to create an interprocedural pass that
>> would reduce code size as much as possible, perhaps at the cost of some
>> performance. This would be useful to, say, embedded programmers who
only
>> have a few kilobytes to work with and a substantial amount of code to
fit
>> in that space.
>>
>>
>> ===============================>> Approach and Initial Results
>> ===============================>> To do this, we chose to
implement an outliner. Outliners find sequences of
>> instructions which would be better off as a function call, by some
measure
>> of "better". In this case, the measure of "better"
is "makes code
>> smaller".
>>
>>
>> ===============================>> Results
>> ===============================>> These results are from a fairly
recent 64-bit Intel processor, using a
>> version of Clang equipped with the outliner prototype versus an
equivalent
>> version of Clang without the outliner.
>>
>> CODE SIZE REDUCTION
>> For tests >=4 Kb in non-outlined size, the outliner currently
provides an
>> average of 12.94% code size reduction on the LLVM test suite in
comparison
>> to a default Clang, up to 51.4% code size reduction. In comparison to a
>> Clang with -Oz, the outliner provides an average of a 1.29% code size
>> reduction, up to a 37% code size reduction. I believe that the -Oz
numbers
>> can be further improved by better tuning the outlining cost model.
>>
>> EXECUTION TIME IMPACT
>> On average, the outliner increases execution time by 2% on the LLVM
test
>> suite, but has been also shown to improve exection time by up to 16%.
>> These results were from a fairly recent Intel processor, so the results
>> may vary. Recent Intel processors have very low latency for function
>> calls, which may impact these results. Execution time improvements are
>> likely dependent on the latency of function calls, instruction caching
>> behaviour, and the execution frequency of the code being outlined. In
>> partucular, using a processor with heavy function call latency will
likely
>> increase execution time overhead.
>>
>>
>> ===============================>> Implementation
>> ===============================>> The outliner, in its current
state, is a MachineModulePass. It finds
>> *identical* sequences of MIR, after register allocation, and pulls them
>> out into their own functions. Thus, it's effectively
assembly-level.
>> Ultimately, the algorithm used is general, so it can sit anywhere, but
MIR
>> was very convenient for the time being.
>>
>> It requires two data structures.
>>
>> 1. A generalized suffix tree
>> 2. A "terminated string"
>>
>> 1: The generalized suffix tree is constructed using Ukkonen's
linear time
>> construction algorithm [1]. They require linear space and support
>> linear-time substring queries. In practice, the construction time for
the
>> suffix tree is the most time consuming part, but I haven't noticed
a
>> difference in compilation time on, say, 12 MB .ll files.
>>
>> 2: To support the suffix tree, we need a "terminated string."
This is a
>> generalized string with an unique terminator character appended to the
>> end. TerminatedStrings can be built from any type.
>>
>> The algorithm is then roughly as follows.
>>
>> 1. For each MachineBasicBlock in the program, build a TerminatedString
for
>> that block.
>> 2. Build a suffix tree for that collection of strings.
>> 3. Query the suffix tree for the longest repeated substring and place
that
>> string in a candidate list. Repeat until none are found.
>> 4. Create functions for each candidate.
>> 5. Replace each candidate with a call to its respective function.
>>
>> Currently, the program itself isn't stored in the suffix tree, but
rather
>> a "proxy string" of integers. This isn't necessary at the
MIR level, but
>> may be for an IR level extension of the algorithm.
>>
>>
>> ===============================>> Challenges
>> ===============================>> 1) MEMORY CONSUMPTION
>> Given a string of length n, a naive suffix tree implementation can take
up
>> to 40n bytes of memory. However, this number can be reduced to 20n with
a
>> bit of work [2]. Currently, the suffix tree stores the entire program,
>> including instructions which really ought not to be outlined, such as
>> returns. These instructions should not be included in the final
>> implementation, but should rather act as terminators for the strings.
This
>> will likely curb memory consumption. Suffix trees have been used in the
>> past in sliding-window-based compression schemes, which may serve as a
>> source of inspiration for reducing memory overhead.[3]
>>
>> Nonetheless, the outliner probably shouldn't be run unless it
really has
>> to be run. It will likely be used mostly in embedded spaces, where the
>> programs have to fit into small devices anyway. Thus, memory overhead
for
>> the compiler likely won't be a problem. The outliner should only be
used
>> in -Oz compilations, and possibly should have its own flag.
>>
>>
>> 2) EXECUTION TIME
>> Currently, the outliner isn't tuned for preventing execution time
>> increases. There is an average of a 2% execution time hit on the tests
in
>> the LLVM test suite, with a few outliers of up to 30%. The outliers are
>> tests which contain hot loops. The outliner really ought to be able to
use
>> profiling information and not outline from hot areas. Another
suggestion
>> people have given me is to add a "never outline" directive
which would
>> allow the user to say something along the lines of "this is a hot
loop,
>> please never outline from it".
>>
>> It's also important to note that these numbers are coming from a
fairly
>> recent Intel processor.
>>
>>
>> 3) CONSTRAINTS ON INSTRUCTIONS
>> The outliner currently won't pull anything out of functions which
use a
>> red zone. It also won't pull anything out that uses the stack,
instruction
>> pointer, uses constant pool indices, CFI indices, jump table indices,
or
>> frame indices. This removes many opportunities for outlining which
would
>> likely be available at a higher level (such as IR). Thus, there's a
case
>> for moving this up to a higher level.
>>
>>
>> ===============================>> Additional Applications
>> ===============================>> The suffix tree itself could be
used as a tool for finding opportunities
>> to refactor code. For example, it could recognize places where the user
>> likely copied and pasted some code. This could be run on codebases to
find
>> areas where people could manually outline things at the source level.
>>
>> Using the terminated string class, it would also be possible to
implement
>> other string algorithms on code. This may open the door to new ways to
>> analyze existing codebases.
>>
>>
>> ===============================>> Roadmap
>> ===============================>> The current outliner is *very*
prototypical. The version I would want to
>> upstream would be a new implementation. Here's what I'd like to
address
>> and accomplish.
>>
>> 1. Ask "what does the LLVM community want from an outliner"
and use that
>> to drive development of the algorithm.
>> 2. Reimplement the outliner, perhaps using a less memory-intensve data
>> structure like a suffix array.
>> 3. Begin adding features to the algorithm, for example:
>>     i.   Teaching the algorithm about hot/cold blocks of code and
taking
>> that into account.
>>     ii.  Simple parameter passing.
>>     iii. Similar function outlining-- eg, noticing that two outlining
>> candidates are similar and can be merged into one function with some
>> control flow.
>>
>>
>> ===============================>> Code
>> ===============================>> Note: This code requires
MachineModulePasses
>>
>> * Main pass:
>>
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>>
>> * Suffix tree:
>>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>>
>> * TerminatedString and TerminatedStringList:
>> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/
>> TerminatedString.h
>>
>> Here are a couple unit tests for the data structures.
>>
>> * Suffix tree unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> SuffixTreeTest.cpp
>>
>> * TerminatedString unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringTest.cpp
>>
>> * TerminatedStringList unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringListTest.cpp
>>
>>
>> ===============================>> References
>> ===============================>> [1] Ukkonen's Algorithm:
>> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>> [2] Suffix Trees and Suffix Arrays:
>> http://web.cs.iastate.edu/~cs548/suffix.pdf
>> [3] Extended Application of Suffix Trees to Data Compression:
>> http://www.larsson.dogma.net/dccpaper.pdf
>>
>>
>> Thanks for reading,
>> Jessica
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160829/68f82918/attachment.html>

Jessica Paquette via llvm-dev

2016-Aug-30 06:05 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Thanks a bunch! I'll try to clarify a bit about why I did what I did.

In order to transform it to a *practically* useful optimization
pass,> though, "results" section should be strenghtened up a bit:
> * You simply say "tests >= 4 Kb", without specifying the
nature of these
> tests at all. Please tell exactly, along with exact compiler options you
> used. If the tests are artificial, they might be good for debugging, but
> worthless as a measure of usefulness of your pass
>
Definitely agree!

The tests that I used were the C/C++ programs in the LLVM test suite, with
externals enabled. What I did was run clang -mno-red-zone against clang
-mno-red-zone with outlining, and clang -mno-red-zone -Oz against clang
-mno-red-zone with outlining. What would be *better* would be to run it on
some real embedded cases versus the programs in the test suite. What I
found in the test suite is there are a few programs which do *really* well
(like, say an 80% reduction!), but it turns out that those programs are
typically entirely macros.

(Some other interesting test cases, less relevant to embedded, might be
programs with heavy template usage.)

* You use "a fairly recent 64-bit Intel processor" to test
an> embedded-targeting optimization, which is odd. If you want to target x86
> (good choice! :-)), perhaps a Quark chip might be a better choice? You can
> measure code size impact without using a real hardware -- just add
"-triple
> i586-intel-elfiamcu" option.
>
The reason I started out using an Intel processor was just to verify that I
could even *run* outlined programs in the first place. I have been working
on extending it to ARM, which would be a far more interesting for embedded
than x86-64. That being said, this *can* potentially be useful on
non-embedded computers but the benefits be much greater in an embedded
space.

(I also didn't know about that flag, so thanks!)

- Jessica

On Mon, Aug 29, 2016 at 12:20 PM, Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi Jessica,
>
> Let me echo others' comments -- this is a very interesting research
> project and a nice write-up, indeed!
>
> In order to transform it to a *practically* useful optimization pass,
> though, "results" section should be strenghtened up a bit:
> * You simply say "tests >= 4 Kb", without specifying the
nature of these
> tests at all. Please tell exactly, along with exact compiler options you
> used. If the tests are artificial, they might be good for debugging, but
> worthless as a measure of usefulness of your pass.
> * You use "a fairly recent 64-bit Intel processor" to test an
> embedded-targeting optimization, which is odd. If you want to target x86
> (good choice! :-)), perhaps a Quark chip might be a better choice? You can
> measure code size impact without using a real hardware -- just add
"-triple
> i586-intel-elfiamcu" option.
>
> Yours,
> Andrey
>
>
> On Sat, Aug 27, 2016 at 12:26 AM, Jessica Paquette via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi everyone,
>>
>> Since I haven't said anything on the mailing list before, a quick
>> introduction. I'm an intern at Apple, and over the summer I
implemented a
>> prototype for an outlining pass for code size in LLVM. Now I'm
seeking to
>> eventually upstream it. I have the preliminary code on GitHub right
now,
>> but it's still very prototypical (see the code section).
>>
>> ===============================>> Motivation
>> ===============================>> The goal of the internship was
to create an interprocedural pass that
>> would reduce code size as much as possible, perhaps at the cost of some
>> performance. This would be useful to, say, embedded programmers who
only
>> have a few kilobytes to work with and a substantial amount of code to
fit
>> in that space.
>>
>>
>> ===============================>> Approach and Initial Results
>> ===============================>> To do this, we chose to
implement an outliner. Outliners find sequences of
>> instructions which would be better off as a function call, by some
measure
>> of "better". In this case, the measure of "better"
is "makes code
>> smaller".
>>
>>
>> ===============================>> Results
>> ===============================>> These results are from a fairly
recent 64-bit Intel processor, using a
>> version of Clang equipped with the outliner prototype versus an
equivalent
>> version of Clang without the outliner.
>>
>> CODE SIZE REDUCTION
>> For tests >=4 Kb in non-outlined size, the outliner currently
provides an
>> average of 12.94% code size reduction on the LLVM test suite in
comparison
>> to a default Clang, up to 51.4% code size reduction. In comparison to a
>> Clang with -Oz, the outliner provides an average of a 1.29% code size
>> reduction, up to a 37% code size reduction. I believe that the -Oz
numbers
>> can be further improved by better tuning the outlining cost model.
>>
>> EXECUTION TIME IMPACT
>> On average, the outliner increases execution time by 2% on the LLVM
test
>> suite, but has been also shown to improve exection time by up to 16%.
>> These results were from a fairly recent Intel processor, so the results
>> may vary. Recent Intel processors have very low latency for function
>> calls, which may impact these results. Execution time improvements are
>> likely dependent on the latency of function calls, instruction caching
>> behaviour, and the execution frequency of the code being outlined. In
>> partucular, using a processor with heavy function call latency will
likely
>> increase execution time overhead.
>>
>>
>> ===============================>> Implementation
>> ===============================>> The outliner, in its current
state, is a MachineModulePass. It finds
>> *identical* sequences of MIR, after register allocation, and pulls them
>> out into their own functions. Thus, it's effectively
assembly-level.
>> Ultimately, the algorithm used is general, so it can sit anywhere, but
MIR
>> was very convenient for the time being.
>>
>> It requires two data structures.
>>
>> 1. A generalized suffix tree
>> 2. A "terminated string"
>>
>> 1: The generalized suffix tree is constructed using Ukkonen's
linear time
>> construction algorithm [1]. They require linear space and support
>> linear-time substring queries. In practice, the construction time for
the
>> suffix tree is the most time consuming part, but I haven't noticed
a
>> difference in compilation time on, say, 12 MB .ll files.
>>
>> 2: To support the suffix tree, we need a "terminated string."
This is a
>> generalized string with an unique terminator character appended to the
>> end. TerminatedStrings can be built from any type.
>>
>> The algorithm is then roughly as follows.
>>
>> 1. For each MachineBasicBlock in the program, build a TerminatedString
for
>> that block.
>> 2. Build a suffix tree for that collection of strings.
>> 3. Query the suffix tree for the longest repeated substring and place
that
>> string in a candidate list. Repeat until none are found.
>> 4. Create functions for each candidate.
>> 5. Replace each candidate with a call to its respective function.
>>
>> Currently, the program itself isn't stored in the suffix tree, but
rather
>> a "proxy string" of integers. This isn't necessary at the
MIR level, but
>> may be for an IR level extension of the algorithm.
>>
>>
>> ===============================>> Challenges
>> ===============================>> 1) MEMORY CONSUMPTION
>> Given a string of length n, a naive suffix tree implementation can take
up
>> to 40n bytes of memory. However, this number can be reduced to 20n with
a
>> bit of work [2]. Currently, the suffix tree stores the entire program,
>> including instructions which really ought not to be outlined, such as
>> returns. These instructions should not be included in the final
>> implementation, but should rather act as terminators for the strings.
This
>> will likely curb memory consumption. Suffix trees have been used in the
>> past in sliding-window-based compression schemes, which may serve as a
>> source of inspiration for reducing memory overhead.[3]
>>
>> Nonetheless, the outliner probably shouldn't be run unless it
really has
>> to be run. It will likely be used mostly in embedded spaces, where the
>> programs have to fit into small devices anyway. Thus, memory overhead
for
>> the compiler likely won't be a problem. The outliner should only be
used
>> in -Oz compilations, and possibly should have its own flag.
>>
>>
>> 2) EXECUTION TIME
>> Currently, the outliner isn't tuned for preventing execution time
>> increases. There is an average of a 2% execution time hit on the tests
in
>> the LLVM test suite, with a few outliers of up to 30%. The outliers are
>> tests which contain hot loops. The outliner really ought to be able to
use
>> profiling information and not outline from hot areas. Another
suggestion
>> people have given me is to add a "never outline" directive
which would
>> allow the user to say something along the lines of "this is a hot
loop,
>> please never outline from it".
>>
>> It's also important to note that these numbers are coming from a
fairly
>> recent Intel processor.
>>
>>
>> 3) CONSTRAINTS ON INSTRUCTIONS
>> The outliner currently won't pull anything out of functions which
use a
>> red zone. It also won't pull anything out that uses the stack,
instruction
>> pointer, uses constant pool indices, CFI indices, jump table indices,
or
>> frame indices. This removes many opportunities for outlining which
would
>> likely be available at a higher level (such as IR). Thus, there's a
case
>> for moving this up to a higher level.
>>
>>
>> ===============================>> Additional Applications
>> ===============================>> The suffix tree itself could be
used as a tool for finding opportunities
>> to refactor code. For example, it could recognize places where the user
>> likely copied and pasted some code. This could be run on codebases to
find
>> areas where people could manually outline things at the source level.
>>
>> Using the terminated string class, it would also be possible to
implement
>> other string algorithms on code. This may open the door to new ways to
>> analyze existing codebases.
>>
>>
>> ===============================>> Roadmap
>> ===============================>> The current outliner is *very*
prototypical. The version I would want to
>> upstream would be a new implementation. Here's what I'd like to
address
>> and accomplish.
>>
>> 1. Ask "what does the LLVM community want from an outliner"
and use that
>> to drive development of the algorithm.
>> 2. Reimplement the outliner, perhaps using a less memory-intensve data
>> structure like a suffix array.
>> 3. Begin adding features to the algorithm, for example:
>>     i.   Teaching the algorithm about hot/cold blocks of code and
taking
>> that into account.
>>     ii.  Simple parameter passing.
>>     iii. Similar function outlining-- eg, noticing that two outlining
>> candidates are similar and can be merged into one function with some
>> control flow.
>>
>>
>> ===============================>> Code
>> ===============================>> Note: This code requires
MachineModulePasses
>>
>> * Main pass:
>>
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>>
>> * Suffix tree:
>>
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>>
>> * TerminatedString and TerminatedStringList:
>> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/
>> TerminatedString.h
>>
>> Here are a couple unit tests for the data structures.
>>
>> * Suffix tree unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> SuffixTreeTest.cpp
>>
>> * TerminatedString unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringTest.cpp
>>
>> * TerminatedStringList unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringListTest.cpp
>>
>>
>> ===============================>> References
>> ===============================>> [1] Ukkonen's Algorithm:
>> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>> [2] Suffix Trees and Suffix Arrays:
>> http://web.cs.iastate.edu/~cs548/suffix.pdf
>> [3] Extended Application of Suffix Trees to Data Compression:
>> http://www.larsson.dogma.net/dccpaper.pdf
>>
>>
>> Thanks for reading,
>> Jessica
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160829/1811f817/attachment.html>

Alex Bradbury via llvm-dev

2016-Aug-30 07:44 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

On 26 August 2016 at 22:26, Jessica Paquette via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> ===============================> Approach and Initial Results
> ===============================> To do this, we chose to implement an
outliner. Outliners find sequences of
> instructions which would be better off as a function call, by some measure
> of "better". In this case, the measure of "better" is
"makes code
> smaller".
>
>
> ===============================> Results
> ===============================> These results are from a fairly recent
64-bit Intel processor, using a
> version of Clang equipped with the outliner prototype versus an equivalent
> version of Clang without the outliner.
>
> CODE SIZE REDUCTION
> For tests >=4 Kb in non-outlined size, the outliner currently provides
an
> average of 12.94% code size reduction on the LLVM test suite in comparison
> to a default Clang, up to 51.4% code size reduction. In comparison to a
> Clang with -Oz, the outliner provides an average of a 1.29% code size
> reduction, up to a 37% code size reduction. I believe that the -Oz numbers
> can be further improved by better tuning the outlining cost model.
Hi Jessica, as I said on Twitter many thanks for sharing such a
detailed write-up. One thing I was wondering is whether you'd looked
at your outlining scheme with reference to function prologues and
epilogues. There has been a patch proposed to use shared epilogues in
compiler-rt for AArch64 in order to reduce code size
(https://reviews.llvm.org/D15600) and work on the RISC-V compressed
instruction set has shown that outlining for prologues/epilogues can
reduce code size by ~5% and weaken the argument for supporting load
multiple and store multiple instructions
(https://riscv.org/wp-content/uploads/2015/06/riscv-compressed-workshop-june2015.pdf,
slide 15).

Having the compiler automatically detect such opportunities seems
attractive. Do you think that these this subset of outlining would be
better addressed by a specifically targeted pass, or within your
MachineOutliner?

Best,

Alex

Violeta Vukobrat via llvm-dev

2016-Sep-28 18:22 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Hi Jessica,


I am looking into trying out your patch and reproducing results that you 
got for code size reduction. Could you tell me what compiler options you 
used? Also, which clang commit did you work with?


Violeta

-- 
Violeta Vukobrat
Software Engineer
RT-RK Computer Based Systems LLC
www.rt-rk.com

Jessica Paquette via llvm-dev

2016-Sep-29 07:38 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Hi Violeta,

I compiled with

clang -Oz and
clang -Oz -mno-red-zone for comparisons against Oz

and

clang -O0 and
clang -O0 -mno-red-zone for comparisons against a default clang.

I unfortunately don’t have the clang commit I worked with on my home laptop, and
don’t have access to the computer I was using at Apple, so I can’t help you
there.

Jessica
> On Sep 28, 2016, at 11:22 AM, Violeta Vukobrat <violeta.vukobrat at
rt-rk.com> wrote:
> 
> Hi Jessica,
> 
> 
> I am looking into trying out your patch and reproducing results that you
got for code size reduction. Could you tell me what compiler options you used?
Also, which clang commit did you work with?
> 
> 
> Violeta
> 
> -- 
> Violeta Vukobrat
> Software Engineer
> RT-RK Computer Based Systems LLC
> www.rt-rk.com

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Aug 2016 - [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Maybe Matching Threads