Kevin Choi via llvm-dev
2016-Aug-26  21:55 UTC
[llvm-dev] [RFC] Interprocedural MIR-level outlining pass
I think the "Motivation" section explained that. I too first thought about "why not at IR?" but the reason looks like MIR, post-RA has the most accurate heuristics (best way to know looks like actually getting there). Do you know if there is any experimental pass that relies on deriving heuristics by a feedback loop after letting, ie. a duplicate module/function/block continue past? Regards, Kevin On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi Jessica, > > This is quite interesting. > > Can you comment on why you started by doing this at the MI level, as > opposed to the IR level. And at the MI level, why after RA instead of > before RA? > > Perhaps the first question is another way of asking about how accurately > we can model call-site code size at the IR level? > > Thanks in advance, > Hal > > ----- Original Message ----- > > From: "Jessica Paquette via llvm-dev" <llvm-dev at lists.llvm.org> > > To: llvm-dev at lists.llvm.org > > Sent: Friday, August 26, 2016 4:26:09 PM > > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass > > > > Hi everyone, > > > > Since I haven't said anything on the mailing list before, a quick > > introduction. I'm an intern at Apple, and over the summer I > > implemented a > > prototype for an outlining pass for code size in LLVM. Now I'm > > seeking to > > eventually upstream it. I have the preliminary code on GitHub right > > now, > > but it's still very prototypical (see the code section). > > > > ===============================> > Motivation > > ===============================> > The goal of the internship was to create an interprocedural pass that > > would reduce code size as much as possible, perhaps at the cost of > > some > > performance. This would be useful to, say, embedded programmers who > > only > > have a few kilobytes to work with and a substantial amount of code to > > fit > > in that space. > > > > > > ===============================> > Approach and Initial Results > > ===============================> > To do this, we chose to implement an outliner. Outliners find > > sequences of > > instructions which would be better off as a function call, by some > > measure > > of "better". In this case, the measure of "better" is "makes code > > smaller". > > > > > > ===============================> > Results > > ===============================> > These results are from a fairly recent 64-bit Intel processor, using > > a > > version of Clang equipped with the outliner prototype versus an > > equivalent > > version of Clang without the outliner. > > > > CODE SIZE REDUCTION > > For tests >=4 Kb in non-outlined size, the outliner currently > > provides an > > average of 12.94% code size reduction on the LLVM test suite in > > comparison > > to a default Clang, up to 51.4% code size reduction. In comparison to > > a > > Clang with -Oz, the outliner provides an average of a 1.29% code size > > reduction, up to a 37% code size reduction. I believe that the -Oz > > numbers > > can be further improved by better tuning the outlining cost model. > > > > EXECUTION TIME IMPACT > > On average, the outliner increases execution time by 2% on the LLVM > > test > > suite, but has been also shown to improve exection time by up to 16%. > > These results were from a fairly recent Intel processor, so the > > results > > may vary. Recent Intel processors have very low latency for function > > calls, which may impact these results. Execution time improvements > > are > > likely dependent on the latency of function calls, instruction > > caching > > behaviour, and the execution frequency of the code being outlined. In > > partucular, using a processor with heavy function call latency will > > likely > > increase execution time overhead. > > > > > > ===============================> > Implementation > > ===============================> > The outliner, in its current state, is a MachineModulePass. It finds > > *identical* sequences of MIR, after register allocation, and pulls > > them > > out into their own functions. Thus, it's effectively assembly-level. > > Ultimately, the algorithm used is general, so it can sit anywhere, > > but MIR > > was very convenient for the time being. > > > > It requires two data structures. > > > > 1. A generalized suffix tree > > 2. A "terminated string" > > > > 1: The generalized suffix tree is constructed using Ukkonen's linear > > time > > construction algorithm [1]. They require linear space and support > > linear-time substring queries. In practice, the construction time for > > the > > suffix tree is the most time consuming part, but I haven't noticed a > > difference in compilation time on, say, 12 MB .ll files. > > > > 2: To support the suffix tree, we need a "terminated string." This is > > a > > generalized string with an unique terminator character appended to > > the > > end. TerminatedStrings can be built from any type. > > > > The algorithm is then roughly as follows. > > > > 1. For each MachineBasicBlock in the program, build a > > TerminatedString for > > that block. > > 2. Build a suffix tree for that collection of strings. > > 3. Query the suffix tree for the longest repeated substring and place > > that > > string in a candidate list. Repeat until none are found. > > 4. Create functions for each candidate. > > 5. Replace each candidate with a call to its respective function. > > > > Currently, the program itself isn't stored in the suffix tree, but > > rather > > a "proxy string" of integers. This isn't necessary at the MIR level, > > but > > may be for an IR level extension of the algorithm. > > > > > > ===============================> > Challenges > > ===============================> > 1) MEMORY CONSUMPTION > > Given a string of length n, a naive suffix tree implementation can > > take up > > to 40n bytes of memory. However, this number can be reduced to 20n > > with a > > bit of work [2]. Currently, the suffix tree stores the entire > > program, > > including instructions which really ought not to be outlined, such as > > returns. These instructions should not be included in the final > > implementation, but should rather act as terminators for the strings. > > This > > will likely curb memory consumption. Suffix trees have been used in > > the > > past in sliding-window-based compression schemes, which may serve as > > a > > source of inspiration for reducing memory overhead.[3] > > > > Nonetheless, the outliner probably shouldn't be run unless it really > > has > > to be run. It will likely be used mostly in embedded spaces, where > > the > > programs have to fit into small devices anyway. Thus, memory overhead > > for > > the compiler likely won't be a problem. The outliner should only be > > used > > in -Oz compilations, and possibly should have its own flag. > > > > > > 2) EXECUTION TIME > > Currently, the outliner isn't tuned for preventing execution time > > increases. There is an average of a 2% execution time hit on the > > tests in > > the LLVM test suite, with a few outliers of up to 30%. The outliers > > are > > tests which contain hot loops. The outliner really ought to be able > > to use > > profiling information and not outline from hot areas. Another > > suggestion > > people have given me is to add a "never outline" directive which > > would > > allow the user to say something along the lines of "this is a hot > > loop, > > please never outline from it". > > > > It's also important to note that these numbers are coming from a > > fairly > > recent Intel processor. > > > > > > 3) CONSTRAINTS ON INSTRUCTIONS > > The outliner currently won't pull anything out of functions which use > > a > > red zone. It also won't pull anything out that uses the stack, > > instruction > > pointer, uses constant pool indices, CFI indices, jump table indices, > > or > > frame indices. This removes many opportunities for outlining which > > would > > likely be available at a higher level (such as IR). Thus, there's a > > case > > for moving this up to a higher level. > > > > > > ===============================> > Additional Applications > > ===============================> > The suffix tree itself could be used as a tool for finding > > opportunities > > to refactor code. For example, it could recognize places where the > > user > > likely copied and pasted some code. This could be run on codebases to > > find > > areas where people could manually outline things at the source level. > > > > Using the terminated string class, it would also be possible to > > implement > > other string algorithms on code. This may open the door to new ways > > to > > analyze existing codebases. > > > > > > ===============================> > Roadmap > > ===============================> > The current outliner is *very* prototypical. The version I would want > > to > > upstream would be a new implementation. Here's what I'd like to > > address > > and accomplish. > > > > 1. Ask "what does the LLVM community want from an outliner" and use > > that > > to drive development of the algorithm. > > 2. Reimplement the outliner, perhaps using a less memory-intensve > > data > > structure like a suffix array. > > 3. Begin adding features to the algorithm, for example: > > i. Teaching the algorithm about hot/cold blocks of code and > > taking > > that into account. > > ii. Simple parameter passing. > > iii. Similar function outlining-- eg, noticing that two outlining > > candidates are similar and can be merged into one function with some > > control flow. > > > > > > ===============================> > Code > > ===============================> > Note: This code requires MachineModulePasses > > > > * Main pass: > > https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h > > > > * Suffix tree: > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h > > > > * TerminatedString and TerminatedStringList: > > https://github.com/ornata/llvm/blob/master/include/llvm/ > ADT/TerminatedString.h > > > > Here are a couple unit tests for the data structures. > > > > * Suffix tree unit tests: > > https://github.com/ornata/llvm/blob/master/unittests/ > ADT/SuffixTreeTest.cpp > > > > * TerminatedString unit tests: > > https://github.com/ornata/llvm/blob/master/unittests/ > ADT/TerminatedStringTest.cpp > > > > * TerminatedStringList unit tests: > > https://github.com/ornata/llvm/blob/master/unittests/ > ADT/TerminatedStringListTest.cpp > > > > > > ===============================> > References > > ===============================> > [1] Ukkonen's Algorithm: > > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf > > [2] Suffix Trees and Suffix Arrays: > > http://web.cs.iastate.edu/~cs548/suffix.pdf > > [3] Extended Application of Suffix Trees to Data Compression: > > http://www.larsson.dogma.net/dccpaper.pdf > > > > > > Thanks for reading, > > Jessica > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/4bc04cb6/attachment-0001.html>
Hal Finkel via llvm-dev
2016-Aug-26  22:01 UTC
[llvm-dev] [RFC] Interprocedural MIR-level outlining pass
----- Original Message -----> From: "Kevin Choi" <code.kchoi at gmail.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Friday, August 26, 2016 4:55:29 PM > Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining > pass> I think the "Motivation" section explained that.I don't think it explained it.> I too first thought about "why not at IR?" but the reason looks like > MIR, post-RA has the most accurate heuristics (best way to know > looks like actually getting there).But also, potentially, the fewest opportunities. That's why I'm curious about the motivation - the trade offs are not obvious to me. -Hal> Do you know if there is any experimental pass that relies on deriving > heuristics by a feedback loop after letting, ie. a duplicate > module/function/block continue past?> Regards,> Kevin> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev < > llvm-dev at lists.llvm.org > wrote:> > Hi Jessica, >> > This is quite interesting. >> > Can you comment on why you started by doing this at the MI level, > > as > > opposed to the IR level. And at the MI level, why after RA instead > > of before RA? >> > Perhaps the first question is another way of asking about how > > accurately we can model call-site code size at the IR level? >> > Thanks in advance, > > > Hal >> > ----- Original Message ----- >> > > From: "Jessica Paquette via llvm-dev" < llvm-dev at lists.llvm.org > > > > > To: llvm-dev at lists.llvm.org > > > > Sent: Friday, August 26, 2016 4:26:09 PM > > > > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining > > > pass > > > > > > > > Hi everyone, > > > > > > > > Since I haven't said anything on the mailing list before, a quick > > > > introduction. I'm an intern at Apple, and over the summer I > > > > implemented a > > > > prototype for an outlining pass for code size in LLVM. Now I'm > > > > seeking to > > > > eventually upstream it. I have the preliminary code on GitHub > > > right > > > > now, > > > > but it's still very prototypical (see the code section). > > > > > > > > ===============================> > > > Motivation > > > > ===============================> > > > The goal of the internship was to create an interprocedural pass > > > that > > > > would reduce code size as much as possible, perhaps at the cost > > > of > > > > some > > > > performance. This would be useful to, say, embedded programmers > > > who > > > > only > > > > have a few kilobytes to work with and a substantial amount of > > > code > > > to > > > > fit > > > > in that space. > > > > > > > > > > > > ===============================> > > > Approach and Initial Results > > > > ===============================> > > > To do this, we chose to implement an outliner. Outliners find > > > > sequences of > > > > instructions which would be better off as a function call, by > > > some > > > > measure > > > > of "better". In this case, the measure of "better" is "makes code > > > > smaller". > > > > > > > > > > > > ===============================> > > > Results > > > > ===============================> > > > These results are from a fairly recent 64-bit Intel processor, > > > using > > > > a > > > > version of Clang equipped with the outliner prototype versus an > > > > equivalent > > > > version of Clang without the outliner. > > > > > > > > CODE SIZE REDUCTION > > > > For tests >=4 Kb in non-outlined size, the outliner currently > > > > provides an > > > > average of 12.94% code size reduction on the LLVM test suite in > > > > comparison > > > > to a default Clang, up to 51.4% code size reduction. In > > > comparison > > > to > > > > a > > > > Clang with -Oz, the outliner provides an average of a 1.29% code > > > size > > > > reduction, up to a 37% code size reduction. I believe that the > > > -Oz > > > > numbers > > > > can be further improved by better tuning the outlining cost > > > model. > > > > > > > > EXECUTION TIME IMPACT > > > > On average, the outliner increases execution time by 2% on the > > > LLVM > > > > test > > > > suite, but has been also shown to improve exection time by up to > > > 16%. > > > > These results were from a fairly recent Intel processor, so the > > > > results > > > > may vary. Recent Intel processors have very low latency for > > > function > > > > calls, which may impact these results. Execution time > > > improvements > > > > are > > > > likely dependent on the latency of function calls, instruction > > > > caching > > > > behaviour, and the execution frequency of the code being > > > outlined. > > > In > > > > partucular, using a processor with heavy function call latency > > > will > > > > likely > > > > increase execution time overhead. > > > > > > > > > > > > ===============================> > > > Implementation > > > > ===============================> > > > The outliner, in its current state, is a MachineModulePass. It > > > finds > > > > *identical* sequences of MIR, after register allocation, and > > > pulls > > > > them > > > > out into their own functions. Thus, it's effectively > > > assembly-level. > > > > Ultimately, the algorithm used is general, so it can sit > > > anywhere, > > > > but MIR > > > > was very convenient for the time being. > > > > > > > > It requires two data structures. > > > > > > > > 1. A generalized suffix tree > > > > 2. A "terminated string" > > > > > > > > 1: The generalized suffix tree is constructed using Ukkonen's > > > linear > > > > time > > > > construction algorithm [1]. They require linear space and support > > > > linear-time substring queries. In practice, the construction time > > > for > > > > the > > > > suffix tree is the most time consuming part, but I haven't > > > noticed > > > a > > > > difference in compilation time on, say, 12 MB .ll files. > > > > > > > > 2: To support the suffix tree, we need a "terminated string." > > > This > > > is > > > > a > > > > generalized string with an unique terminator character appended > > > to > > > > the > > > > end. TerminatedStrings can be built from any type. > > > > > > > > The algorithm is then roughly as follows. > > > > > > > > 1. For each MachineBasicBlock in the program, build a > > > > TerminatedString for > > > > that block. > > > > 2. Build a suffix tree for that collection of strings. > > > > 3. Query the suffix tree for the longest repeated substring and > > > place > > > > that > > > > string in a candidate list. Repeat until none are found. > > > > 4. Create functions for each candidate. > > > > 5. Replace each candidate with a call to its respective function. > > > > > > > > Currently, the program itself isn't stored in the suffix tree, > > > but > > > > rather > > > > a "proxy string" of integers. This isn't necessary at the MIR > > > level, > > > > but > > > > may be for an IR level extension of the algorithm. > > > > > > > > > > > > ===============================> > > > Challenges > > > > ===============================> > > > 1) MEMORY CONSUMPTION > > > > Given a string of length n, a naive suffix tree implementation > > > can > > > > take up > > > > to 40n bytes of memory. However, this number can be reduced to > > > 20n > > > > with a > > > > bit of work [2]. Currently, the suffix tree stores the entire > > > > program, > > > > including instructions which really ought not to be outlined, > > > such > > > as > > > > returns. These instructions should not be included in the final > > > > implementation, but should rather act as terminators for the > > > strings. > > > > This > > > > will likely curb memory consumption. Suffix trees have been used > > > in > > > > the > > > > past in sliding-window-based compression schemes, which may serve > > > as > > > > a > > > > source of inspiration for reducing memory overhead.[3] > > > > > > > > Nonetheless, the outliner probably shouldn't be run unless it > > > really > > > > has > > > > to be run. It will likely be used mostly in embedded spaces, > > > where > > > > the > > > > programs have to fit into small devices anyway. Thus, memory > > > overhead > > > > for > > > > the compiler likely won't be a problem. The outliner should only > > > be > > > > used > > > > in -Oz compilations, and possibly should have its own flag. > > > > > > > > > > > > 2) EXECUTION TIME > > > > Currently, the outliner isn't tuned for preventing execution time > > > > increases. There is an average of a 2% execution time hit on the > > > > tests in > > > > the LLVM test suite, with a few outliers of up to 30%. The > > > outliers > > > > are > > > > tests which contain hot loops. The outliner really ought to be > > > able > > > > to use > > > > profiling information and not outline from hot areas. Another > > > > suggestion > > > > people have given me is to add a "never outline" directive which > > > > would > > > > allow the user to say something along the lines of "this is a hot > > > > loop, > > > > please never outline from it". > > > > > > > > It's also important to note that these numbers are coming from a > > > > fairly > > > > recent Intel processor. > > > > > > > > > > > > 3) CONSTRAINTS ON INSTRUCTIONS > > > > The outliner currently won't pull anything out of functions which > > > use > > > > a > > > > red zone. It also won't pull anything out that uses the stack, > > > > instruction > > > > pointer, uses constant pool indices, CFI indices, jump table > > > indices, > > > > or > > > > frame indices. This removes many opportunities for outlining > > > which > > > > would > > > > likely be available at a higher level (such as IR). Thus, there's > > > a > > > > case > > > > for moving this up to a higher level. > > > > > > > > > > > > ===============================> > > > Additional Applications > > > > ===============================> > > > The suffix tree itself could be used as a tool for finding > > > > opportunities > > > > to refactor code. For example, it could recognize places where > > > the > > > > user > > > > likely copied and pasted some code. This could be run on > > > codebases > > > to > > > > find > > > > areas where people could manually outline things at the source > > > level. > > > > > > > > Using the terminated string class, it would also be possible to > > > > implement > > > > other string algorithms on code. This may open the door to new > > > ways > > > > to > > > > analyze existing codebases. > > > > > > > > > > > > ===============================> > > > Roadmap > > > > ===============================> > > > The current outliner is *very* prototypical. The version I would > > > want > > > > to > > > > upstream would be a new implementation. Here's what I'd like to > > > > address > > > > and accomplish. > > > > > > > > 1. Ask "what does the LLVM community want from an outliner" and > > > use > > > > that > > > > to drive development of the algorithm. > > > > 2. Reimplement the outliner, perhaps using a less memory-intensve > > > > data > > > > structure like a suffix array. > > > > 3. Begin adding features to the algorithm, for example: > > > > i. Teaching the algorithm about hot/cold blocks of code and > > > > taking > > > > that into account. > > > > ii. Simple parameter passing. > > > > iii. Similar function outlining-- eg, noticing that two outlining > > > > candidates are similar and can be merged into one function with > > > some > > > > control flow. > > > > > > > > > > > > ===============================> > > > Code > > > > ===============================> > > > Note: This code requires MachineModulePasses > > > > > > > > * Main pass: > > > > https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h > > > > > > > > * Suffix tree: > > > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h > > > > > > > > * TerminatedString and TerminatedStringList: > > > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h > > > > > > > > Here are a couple unit tests for the data structures. > > > > > > > > * Suffix tree unit tests: > > > > https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp > > > > > > > > * TerminatedString unit tests: > > > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp > > > > > > > > * TerminatedStringList unit tests: > > > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp > > > > > > > > > > > > ===============================> > > > References > > > > ===============================> > > > [1] Ukkonen's Algorithm: > > > > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf > > > > [2] Suffix Trees and Suffix Arrays: > > > > http://web.cs.iastate.edu/~cs548/suffix.pdf > > > > [3] Extended Application of Suffix Trees to Data Compression: > > > > http://www.larsson.dogma.net/dccpaper.pdf > > > > > > > > > > > > Thanks for reading, > > > > Jessica > > > > > > > > _______________________________________________ > > > > LLVM Developers mailing list > > > > llvm-dev at lists.llvm.org > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > >> > -- > > > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >> > _______________________________________________ > > > LLVM Developers mailing list > > > llvm-dev at lists.llvm.org > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/0d4885d5/attachment.html>
Quentin Colombet via llvm-dev
2016-Aug-26  23:54 UTC
[llvm-dev] [RFC] Interprocedural MIR-level outlining pass
Hi,
I let Jessica give more details but here are some insights.
MIR offers a straight forward way to model benefits, because we know which
instructions we remove and which one we add and there are no overhead of setting
up parameters. Indeed, since the coloring will be the same between the different
outlining candidates, the call is just a jump somewhere else. We do not have to
worry about the impact of parameter passing and ABI.
So basically, better cost model. That's one part of the story.
The other part is at the LLVM IR level or before register allocation identifying
similar code sequence is much harder, at least with a suffix tree like
algorithm. Basically the problem is how do we name our instructions such that we
can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f; 
}
bar() {
d = add e, g;
f = add c, w;
}
With proper renaming, we can outline both adds in one function. The difficulty
is to recognize that they are semantically equivalent to give them the same
identifier in the suffix tree. I won’t get into the details but it gets tricky
quickly. We were thinking of reusing GVN to have such identifier if we wanted to
do the outlining at IR level but solving this problem is hard.
By running after regalloc, we basically have a heuristic that does this naming
for us.
Cheers,
-Quentin 
> On Aug 26, 2016, at 3:01 PM, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> From: "Kevin Choi" <code.kchoi at gmail.com
<mailto:code.kchoi at gmail.com>>
> To: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev
at lists.llvm.org>>
> Sent: Friday, August 26, 2016 4:55:29 PM
> Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
> 
> I think the "Motivation" section explained that.
> I don't think it explained it.
> I too first thought about "why not at IR?" but the reason looks
like MIR, post-RA has the most accurate heuristics (best way to know looks like
actually getting there).
> But also, potentially, the fewest opportunities. That's why I'm
curious about the motivation - the trade offs are not obvious to me.
> 
>  -Hal
> 
> 
> Do you know if there is any experimental pass that relies on deriving
heuristics by a feedback loop after letting, ie. a duplicate
module/function/block continue past?
> 
> Regards,
> Kevin
> 
> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Hi Jessica,
> 
> This is quite interesting.
> 
> Can you comment on why you started by doing this at the MI level, as
opposed to the IR level. And at the MI level, why after RA instead of before RA?
> 
> Perhaps the first question is another way of asking about how accurately we
can model call-site code size at the IR level?
> 
> Thanks in advance,
> Hal
> 
> 
> > From: "Jessica Paquette via llvm-dev" <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> > To: llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
> > Sent: Friday, August 26, 2016 4:26:09 PM
> > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
> >
> > Hi everyone,
> >
> > Since I haven't said anything on the mailing list before, a quick
> > introduction. I'm an intern at Apple, and over the summer I
> > implemented a
> > prototype for an outlining pass for code size in LLVM. Now I'm
> > seeking to
> > eventually upstream it. I have the preliminary code on GitHub right
> > now,
> > but it's still very prototypical (see the code section).
> >
> > ===============================> > Motivation
> > ===============================> > The goal of the internship
was to create an interprocedural pass that
> > would reduce code size as much as possible, perhaps at the cost of
> > some
> > performance. This would be useful to, say, embedded programmers who
> > only
> > have a few kilobytes to work with and a substantial amount of code to
> > fit
> > in that space.
> >
> >
> > ===============================> > Approach and Initial Results
> > ===============================> > To do this, we chose to
implement an outliner. Outliners find
> > sequences of
> > instructions which would be better off as a function call, by some
> > measure
> > of "better". In this case, the measure of "better"
is "makes code
> > smaller".
> >
> >
> > ===============================> > Results
> > ===============================> > These results are from a
fairly recent 64-bit Intel processor, using
> > a
> > version of Clang equipped with the outliner prototype versus an
> > equivalent
> > version of Clang without the outliner.
> >
> > CODE SIZE REDUCTION
> > For tests >=4 Kb in non-outlined size, the outliner currently
> > provides an
> > average of 12.94% code size reduction on the LLVM test suite in
> > comparison
> > to a default Clang, up to 51.4% code size reduction. In comparison to
> > a
> > Clang with -Oz, the outliner provides an average of a 1.29% code size
> > reduction, up to a 37% code size reduction. I believe that the -Oz
> > numbers
> > can be further improved by better tuning the outlining cost model.
> >
> > EXECUTION TIME IMPACT
> > On average, the outliner increases execution time by 2% on the LLVM
> > test
> > suite, but has been also shown to improve exection time by up to 16%.
> > These results were from a fairly recent Intel processor, so the
> > results
> > may vary. Recent Intel processors have very low latency for function
> > calls, which may impact these results. Execution time improvements
> > are
> > likely dependent on the latency of function calls, instruction
> > caching
> > behaviour, and the execution frequency of the code being outlined. In
> > partucular, using a processor with heavy function call latency will
> > likely
> > increase execution time overhead.
> >
> >
> > ===============================> > Implementation
> > ===============================> > The outliner, in its current
state, is a MachineModulePass. It finds
> > *identical* sequences of MIR, after register allocation, and pulls
> > them
> > out into their own functions. Thus, it's effectively
assembly-level.
> > Ultimately, the algorithm used is general, so it can sit anywhere,
> > but MIR
> > was very convenient for the time being.
> >
> > It requires two data structures.
> >
> > 1. A generalized suffix tree
> > 2. A "terminated string"
> >
> > 1: The generalized suffix tree is constructed using Ukkonen's
linear
> > time
> > construction algorithm [1]. They require linear space and support
> > linear-time substring queries. In practice, the construction time for
> > the
> > suffix tree is the most time consuming part, but I haven't noticed
a
> > difference in compilation time on, say, 12 MB .ll files.
> >
> > 2: To support the suffix tree, we need a "terminated
string." This is
> > a
> > generalized string with an unique terminator character appended to
> > the
> > end. TerminatedStrings can be built from any type.
> >
> > The algorithm is then roughly as follows.
> >
> > 1. For each MachineBasicBlock in the program, build a
> > TerminatedString for
> > that block.
> > 2. Build a suffix tree for that collection of strings.
> > 3. Query the suffix tree for the longest repeated substring and place
> > that
> > string in a candidate list. Repeat until none are found.
> > 4. Create functions for each candidate.
> > 5. Replace each candidate with a call to its respective function.
> >
> > Currently, the program itself isn't stored in the suffix tree, but
> > rather
> > a "proxy string" of integers. This isn't necessary at
the MIR level,
> > but
> > may be for an IR level extension of the algorithm.
> >
> >
> > ===============================> > Challenges
> > ===============================> > 1) MEMORY CONSUMPTION
> > Given a string of length n, a naive suffix tree implementation can
> > take up
> > to 40n bytes of memory. However, this number can be reduced to 20n
> > with a
> > bit of work [2]. Currently, the suffix tree stores the entire
> > program,
> > including instructions which really ought not to be outlined, such as
> > returns. These instructions should not be included in the final
> > implementation, but should rather act as terminators for the strings.
> > This
> > will likely curb memory consumption. Suffix trees have been used in
> > the
> > past in sliding-window-based compression schemes, which may serve as
> > a
> > source of inspiration for reducing memory overhead.[3]
> >
> > Nonetheless, the outliner probably shouldn't be run unless it
really
> > has
> > to be run. It will likely be used mostly in embedded spaces, where
> > the
> > programs have to fit into small devices anyway. Thus, memory overhead
> > for
> > the compiler likely won't be a problem. The outliner should only
be
> > used
> > in -Oz compilations, and possibly should have its own flag.
> >
> >
> > 2) EXECUTION TIME
> > Currently, the outliner isn't tuned for preventing execution time
> > increases. There is an average of a 2% execution time hit on the
> > tests in
> > the LLVM test suite, with a few outliers of up to 30%. The outliers
> > are
> > tests which contain hot loops. The outliner really ought to be able
> > to use
> > profiling information and not outline from hot areas. Another
> > suggestion
> > people have given me is to add a "never outline" directive
which
> > would
> > allow the user to say something along the lines of "this is a hot
> > loop,
> > please never outline from it".
> >
> > It's also important to note that these numbers are coming from a
> > fairly
> > recent Intel processor.
> >
> >
> > 3) CONSTRAINTS ON INSTRUCTIONS
> > The outliner currently won't pull anything out of functions which
use
> > a
> > red zone. It also won't pull anything out that uses the stack,
> > instruction
> > pointer, uses constant pool indices, CFI indices, jump table indices,
> > or
> > frame indices. This removes many opportunities for outlining which
> > would
> > likely be available at a higher level (such as IR). Thus, there's
a
> > case
> > for moving this up to a higher level.
> >
> >
> > ===============================> > Additional Applications
> > ===============================> > The suffix tree itself could
be used as a tool for finding
> > opportunities
> > to refactor code. For example, it could recognize places where the
> > user
> > likely copied and pasted some code. This could be run on codebases to
> > find
> > areas where people could manually outline things at the source level.
> >
> > Using the terminated string class, it would also be possible to
> > implement
> > other string algorithms on code. This may open the door to new ways
> > to
> > analyze existing codebases.
> >
> >
> > ===============================> > Roadmap
> > ===============================> > The current outliner is
*very* prototypical. The version I would want
> > to
> > upstream would be a new implementation. Here's what I'd like
to
> > address
> > and accomplish.
> >
> > 1. Ask "what does the LLVM community want from an outliner"
and use
> > that
> > to drive development of the algorithm.
> > 2. Reimplement the outliner, perhaps using a less memory-intensve
> > data
> > structure like a suffix array.
> > 3. Begin adding features to the algorithm, for example:
> >     i.   Teaching the algorithm about hot/cold blocks of code and
> >     taking
> > that into account.
> >     ii.  Simple parameter passing.
> >     iii. Similar function outlining-- eg, noticing that two outlining
> > candidates are similar and can be merged into one function with some
> > control flow.
> >
> >
> > ===============================> > Code
> > ===============================> > Note: This code requires
MachineModulePasses
> >
> > * Main pass:
> >
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
<https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h>
> >
> > * Suffix tree:
> >
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
<https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h>
> >
> > * TerminatedString and TerminatedStringList:
> >
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
<https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h>
> >
> > Here are a couple unit tests for the data structures.
> >
> > * Suffix tree unit tests:
> >
https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
<https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp>
> >
> > * TerminatedString unit tests:
> >
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
<https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp>
> >
> > * TerminatedStringList unit tests:
> >
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
<https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp>
> >
> >
> > ===============================> > References
> > ===============================> > [1] Ukkonen's Algorithm:
> > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
<https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf>
> > [2] Suffix Trees and Suffix Arrays:
> > http://web.cs.iastate.edu/~cs548/suffix.pdf
<http://web.cs.iastate.edu/%7Ecs548/suffix.pdf>
> > [3] Extended Application of Suffix Trees to Data Compression:
> > http://www.larsson.dogma.net/dccpaper.pdf
<http://www.larsson.dogma.net/dccpaper.pdf>
> >
> >
> > Thanks for reading,
> > Jessica
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/b0048275/attachment.html>