thr3ads.net - llvm dev - [llvm-dev] [RFC] Interprocedural MIR-level outlining pass [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Aug-27 04:44 UTC

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

----- Original Message -----
> From: "Daniel Berlin" <dberlin at dberlin.org>
> To: "Quentin Colombet" <qcolombet at apple.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Friday, August 26, 2016 11:06:56 PM
> Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> pass
> FWIW: I'm with quentin. Without a good value numbering analysis, this
> is a hard problem.
How exactly does value numbering help here? This problem seems to be about
finding structurally-similar parts of the computations of different values, not
identical values. It seems much closer to the analysis necessary for SLP
vectorization than value numbering, as such.

I think we might be able to use value numbering to help with subset of SLP
vectorization, but only because we can define some kind of equivalence relation
on consecutive memory accesses (and similar). What you need for this outlining
seems more akin to the general SLP-vectorization case. This might just be the
maximal common subgraph problem.

-Hal 
> GVN as it exists now doesn't really provide what you want, and in
> fact, doesn't value number a lot of instruction types (including,
> for example, loads, stores, calls, phis, etc). It also relies on
> ordering, and more importantly, it relies on *not* being a strong
> analysis. If you were to stop it from eliminating as it went, it
> would catch maybe 10% of the redundancies it does now. So what you
> want is "i want to know what globally is the same", it
doesn't
> really answer that well.
> Doing the analysis Quentin wants is pretty trivial in NewGVN
> (analysis and elimination are split, everything is broken into
> congruence classes explicitly, each congruence class has an id, you
> could sub in that id as the value for the terminated string), but
> i'd agree that GVN as it exists now will not do what they want, and
> would be pretty hard to make work well.
> (FWIW: Davide Italiano has started breaking up newgvn into
> submittable pieces, and is pretty far along to having a first patch
> set I believe)
> On Fri, Aug 26, 2016 at 4:54 PM, Quentin Colombet via llvm-dev <
> llvm-dev at lists.llvm.org > wrote:
> > Hi,
> 
> > I let Jessica give more details but here are some insights.
> 
> > MIR offers a straight forward way to model benefits, because we
> > know
> > which instructions we remove and which one we add and there are no
> > overhead of setting up parameters. Indeed, since the coloring will
> > be the same between the different outlining candidates, the call is
> > just a jump somewhere else. We do not have to worry about the
> > impact
> > of parameter passing and ABI.
> 
> > So basically, better cost model. That's one part of the story.
> 
> > The other part is at the LLVM IR level or before register
> > allocation
> > identifying similar code sequence is much harder, at least with a
> > suffix tree like algorithm. Basically the problem is how do we name
> > our instructions such that we can match them.
> 
> > Let me take an example.
> 
> > foo() {
> 
> > /* bunch of code */
> 
> > a = add b, c;
> 
> > d = add e, f;
> 
> > }
> 
> > bar() {
> 
> > d = add e, g;
> 
> > f = add c, w;
> 
> > }
> 
> > With proper renaming, we can outline both adds in one function. The
> > difficulty is to recognize that they are semantically equivalent to
> > give them the same identifier in the suffix tree. I won’t get into
> > the details but it gets tricky quickly. We were thinking of reusing
> > GVN to have such identifier if we wanted to do the outlining at IR
> > level but solving this problem is hard.
> 
> > By running after regalloc, we basically have a heuristic that does
> > this naming for us.
> 
> > Cheers,
> 
> > -Quentin
> 
> > > On Aug 26, 2016, at 3:01 PM, Hal Finkel via llvm-dev <
> > > llvm-dev at lists.llvm.org > wrote:
> > 
> 
> > > > From: "Kevin Choi" < code.kchoi at gmail.com
>
> > > 
> > 
> 
> > > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > 
> > 
> 
> > > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org
>
> > > 
> > 
> 
> > > > Sent: Friday, August 26, 2016 4:55:29 PM
> > > 
> > 
> 
> > > > Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level
> > > > outlining
> > > > pass
> > > 
> > 
> 
> > > > I think the "Motivation" section explained that.
> > > 
> > 
> 
> > > I don't think it explained it.
> > 
> 
> > > > I too first thought about "why not at IR?" but the
reason looks
> > > > like
> > > > MIR, post-RA has the most accurate heuristics (best way to
know
> > > > looks like actually getting there).
> > > 
> > 
> 
> > > But also, potentially, the fewest opportunities. That's why
I'm
> > > curious about the motivation - the trade offs are not obvious to
> > > me.
> > 
> 
> > > -Hal
> > 
> 
> > > > Do you know if there is any experimental pass that relies on
> > > > deriving
> > > > heuristics by a feedback loop after letting, ie. a duplicate
> > > > module/function/block continue past?
> > > 
> > 
> 
> > > > Regards,
> > > 
> > 
> 
> > > > Kevin
> > > 
> > 
> 
> > > > On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <
> > > > llvm-dev at lists.llvm.org > wrote:
> > > 
> > 
> 
> > > > > Hi Jessica,
> > > > 
> > > 
> > 
> 
> > > > > This is quite interesting.
> > > > 
> > > 
> > 
> 
> > > > > Can you comment on why you started by doing this at the
MI
> > > > > level,
> > > > > as
> > > > > opposed to the IR level. And at the MI level, why after
RA
> > > > > instead
> > > > > of before RA?
> > > > 
> > > 
> > 
> 
> > > > > Perhaps the first question is another way of asking
about how
> > > > > accurately we can model call-site code size at the IR
level?
> > > > 
> > > 
> > 
> 
> > > > > Thanks in advance,
> > > > 
> > > 
> > 
> 
> > > > > Hal
> > > > 
> > > 
> > 
> 
> > > > > > From: "Jessica Paquette via llvm-dev"
<
> > > > > > llvm-dev at lists.llvm.org
> > > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > To: llvm-dev at lists.llvm.org
> > > > 
> > > 
> > 
> 
> > > > > > Sent: Friday, August 26, 2016 4:26:09 PM
> > > > 
> > > 
> > 
> 
> > > > > > Subject: [llvm-dev] [RFC] Interprocedural
MIR-level
> > > > > > outlining
> > > > > > pass
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Hi everyone,
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Since I haven't said anything on the mailing
list before, a
> > > > > > quick
> > > > 
> > > 
> > 
> 
> > > > > > introduction. I'm an intern at Apple, and over
the summer I
> > > > 
> > > 
> > 
> 
> > > > > > implemented a
> > > > 
> > > 
> > 
> 
> > > > > > prototype for an outlining pass for code size in
LLVM. Now
> > > > > > I'm
> > > > 
> > > 
> > 
> 
> > > > > > seeking to
> > > > 
> > > 
> > 
> 
> > > > > > eventually upstream it. I have the preliminary
code on
> > > > > > GitHub
> > > > > > right
> > > > 
> > > 
> > 
> 
> > > > > > now,
> > > > 
> > > 
> > 
> 
> > > > > > but it's still very prototypical (see the code
section).
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Motivation
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > The goal of the internship was to create an
interprocedural
> > > > > > pass
> > > > > > that
> > > > 
> > > 
> > 
> 
> > > > > > would reduce code size as much as possible,
perhaps at the
> > > > > > cost
> > > > > > of
> > > > 
> > > 
> > 
> 
> > > > > > some
> > > > 
> > > 
> > 
> 
> > > > > > performance. This would be useful to, say,
embedded
> > > > > > programmers
> > > > > > who
> > > > 
> > > 
> > 
> 
> > > > > > only
> > > > 
> > > 
> > 
> 
> > > > > > have a few kilobytes to work with and a
substantial amount
> > > > > > of
> > > > > > code
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > fit
> > > > 
> > > 
> > 
> 
> > > > > > in that space.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Approach and Initial Results
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > To do this, we chose to implement an outliner.
Outliners
> > > > > > find
> > > > 
> > > 
> > 
> 
> > > > > > sequences of
> > > > 
> > > 
> > 
> 
> > > > > > instructions which would be better off as a
function call,
> > > > > > by
> > > > > > some
> > > > 
> > > 
> > 
> 
> > > > > > measure
> > > > 
> > > 
> > 
> 
> > > > > > of "better". In this case, the measure
of "better" is
> > > > > > "makes
> > > > > > code
> > > > 
> > > 
> > 
> 
> > > > > > smaller".
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Results
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > These results are from a fairly recent 64-bit
Intel
> > > > > > processor,
> > > > > > using
> > > > 
> > > 
> > 
> 
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > version of Clang equipped with the outliner
prototype
> > > > > > versus
> > > > > > an
> > > > 
> > > 
> > 
> 
> > > > > > equivalent
> > > > 
> > > 
> > 
> 
> > > > > > version of Clang without the outliner.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > CODE SIZE REDUCTION
> > > > 
> > > 
> > 
> 
> > > > > > For tests >=4 Kb in non-outlined size, the
outliner
> > > > > > currently
> > > > 
> > > 
> > 
> 
> > > > > > provides an
> > > > 
> > > 
> > 
> 
> > > > > > average of 12.94% code size reduction on the LLVM
test
> > > > > > suite
> > > > > > in
> > > > 
> > > 
> > 
> 
> > > > > > comparison
> > > > 
> > > 
> > 
> 
> > > > > > to a default Clang, up to 51.4% code size
reduction. In
> > > > > > comparison
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > Clang with -Oz, the outliner provides an average
of a 1.29%
> > > > > > code
> > > > > > size
> > > > 
> > > 
> > 
> 
> > > > > > reduction, up to a 37% code size reduction. I
believe that
> > > > > > the
> > > > > > -Oz
> > > > 
> > > 
> > 
> 
> > > > > > numbers
> > > > 
> > > 
> > 
> 
> > > > > > can be further improved by better tuning the
outlining cost
> > > > > > model.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > EXECUTION TIME IMPACT
> > > > 
> > > 
> > 
> 
> > > > > > On average, the outliner increases execution time
by 2% on
> > > > > > the
> > > > > > LLVM
> > > > 
> > > 
> > 
> 
> > > > > > test
> > > > 
> > > 
> > 
> 
> > > > > > suite, but has been also shown to improve exection
time by
> > > > > > up
> > > > > > to
> > > > > > 16%.
> > > > 
> > > 
> > 
> 
> > > > > > These results were from a fairly recent Intel
processor, so
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > results
> > > > 
> > > 
> > 
> 
> > > > > > may vary. Recent Intel processors have very low
latency for
> > > > > > function
> > > > 
> > > 
> > 
> 
> > > > > > calls, which may impact these results. Execution
time
> > > > > > improvements
> > > > 
> > > 
> > 
> 
> > > > > > are
> > > > 
> > > 
> > 
> 
> > > > > > likely dependent on the latency of function calls,
> > > > > > instruction
> > > > 
> > > 
> > 
> 
> > > > > > caching
> > > > 
> > > 
> > 
> 
> > > > > > behaviour, and the execution frequency of the code
being
> > > > > > outlined.
> > > > > > In
> > > > 
> > > 
> > 
> 
> > > > > > partucular, using a processor with heavy function
call
> > > > > > latency
> > > > > > will
> > > > 
> > > 
> > 
> 
> > > > > > likely
> > > > 
> > > 
> > 
> 
> > > > > > increase execution time overhead.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Implementation
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > The outliner, in its current state, is a
MachineModulePass.
> > > > > > It
> > > > > > finds
> > > > 
> > > 
> > 
> 
> > > > > > *identical* sequences of MIR, after register
allocation,
> > > > > > and
> > > > > > pulls
> > > > 
> > > 
> > 
> 
> > > > > > them
> > > > 
> > > 
> > 
> 
> > > > > > out into their own functions. Thus, it's
effectively
> > > > > > assembly-level.
> > > > 
> > > 
> > 
> 
> > > > > > Ultimately, the algorithm used is general, so it
can sit
> > > > > > anywhere,
> > > > 
> > > 
> > 
> 
> > > > > > but MIR
> > > > 
> > > 
> > 
> 
> > > > > > was very convenient for the time being.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > It requires two data structures.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 1. A generalized suffix tree
> > > > 
> > > 
> > 
> 
> > > > > > 2. A "terminated string"
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 1: The generalized suffix tree is constructed
using
> > > > > > Ukkonen's
> > > > > > linear
> > > > 
> > > 
> > 
> 
> > > > > > time
> > > > 
> > > 
> > 
> 
> > > > > > construction algorithm [1]. They require linear
space and
> > > > > > support
> > > > 
> > > 
> > 
> 
> > > > > > linear-time substring queries. In practice, the
> > > > > > construction
> > > > > > time
> > > > > > for
> > > > 
> > > 
> > 
> 
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > suffix tree is the most time consuming part, but I
haven't
> > > > > > noticed
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > difference in compilation time on, say, 12 MB .ll
files.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 2: To support the suffix tree, we need a
"terminated
> > > > > > string."
> > > > > > This
> > > > > > is
> > > > 
> > > 
> > 
> 
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > generalized string with an unique terminator
character
> > > > > > appended
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > end. TerminatedStrings can be built from any type.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > The algorithm is then roughly as follows.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 1. For each MachineBasicBlock in the program,
build a
> > > > 
> > > 
> > 
> 
> > > > > > TerminatedString for
> > > > 
> > > 
> > 
> 
> > > > > > that block.
> > > > 
> > > 
> > 
> 
> > > > > > 2. Build a suffix tree for that collection of
strings.
> > > > 
> > > 
> > 
> 
> > > > > > 3. Query the suffix tree for the longest repeated
substring
> > > > > > and
> > > > > > place
> > > > 
> > > 
> > 
> 
> > > > > > that
> > > > 
> > > 
> > 
> 
> > > > > > string in a candidate list. Repeat until none are
found.
> > > > 
> > > 
> > 
> 
> > > > > > 4. Create functions for each candidate.
> > > > 
> > > 
> > 
> 
> > > > > > 5. Replace each candidate with a call to its
respective
> > > > > > function.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Currently, the program itself isn't stored in
the suffix
> > > > > > tree,
> > > > > > but
> > > > 
> > > 
> > 
> 
> > > > > > rather
> > > > 
> > > 
> > 
> 
> > > > > > a "proxy string" of integers. This
isn't necessary at the
> > > > > > MIR
> > > > > > level,
> > > > 
> > > 
> > 
> 
> > > > > > but
> > > > 
> > > 
> > 
> 
> > > > > > may be for an IR level extension of the algorithm.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Challenges
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > 1) MEMORY CONSUMPTION
> > > > 
> > > 
> > 
> 
> > > > > > Given a string of length n, a naive suffix tree
> > > > > > implementation
> > > > > > can
> > > > 
> > > 
> > 
> 
> > > > > > take up
> > > > 
> > > 
> > 
> 
> > > > > > to 40n bytes of memory. However, this number can
be reduced
> > > > > > to
> > > > > > 20n
> > > > 
> > > 
> > 
> 
> > > > > > with a
> > > > 
> > > 
> > 
> 
> > > > > > bit of work [2]. Currently, the suffix tree stores
the
> > > > > > entire
> > > > 
> > > 
> > 
> 
> > > > > > program,
> > > > 
> > > 
> > 
> 
> > > > > > including instructions which really ought not to
be
> > > > > > outlined,
> > > > > > such
> > > > > > as
> > > > 
> > > 
> > 
> 
> > > > > > returns. These instructions should not be included
in the
> > > > > > final
> > > > 
> > > 
> > 
> 
> > > > > > implementation, but should rather act as
terminators for
> > > > > > the
> > > > > > strings.
> > > > 
> > > 
> > 
> 
> > > > > > This
> > > > 
> > > 
> > 
> 
> > > > > > will likely curb memory consumption. Suffix trees
have been
> > > > > > used
> > > > > > in
> > > > 
> > > 
> > 
> 
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > past in sliding-window-based compression schemes,
which may
> > > > > > serve
> > > > > > as
> > > > 
> > > 
> > 
> 
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > source of inspiration for reducing memory
overhead.[3]
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Nonetheless, the outliner probably shouldn't
be run unless
> > > > > > it
> > > > > > really
> > > > 
> > > 
> > 
> 
> > > > > > has
> > > > 
> > > 
> > 
> 
> > > > > > to be run. It will likely be used mostly in
embedded
> > > > > > spaces,
> > > > > > where
> > > > 
> > > 
> > 
> 
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > programs have to fit into small devices anyway.
Thus,
> > > > > > memory
> > > > > > overhead
> > > > 
> > > 
> > 
> 
> > > > > > for
> > > > 
> > > 
> > 
> 
> > > > > > the compiler likely won't be a problem. The
outliner should
> > > > > > only
> > > > > > be
> > > > 
> > > 
> > 
> 
> > > > > > used
> > > > 
> > > 
> > 
> 
> > > > > > in -Oz compilations, and possibly should have its
own flag.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 2) EXECUTION TIME
> > > > 
> > > 
> > 
> 
> > > > > > Currently, the outliner isn't tuned for
preventing
> > > > > > execution
> > > > > > time
> > > > 
> > > 
> > 
> 
> > > > > > increases. There is an average of a 2% execution
time hit
> > > > > > on
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > tests in
> > > > 
> > > 
> > 
> 
> > > > > > the LLVM test suite, with a few outliers of up to
30%. The
> > > > > > outliers
> > > > 
> > > 
> > 
> 
> > > > > > are
> > > > 
> > > 
> > 
> 
> > > > > > tests which contain hot loops. The outliner really
ought to
> > > > > > be
> > > > > > able
> > > > 
> > > 
> > 
> 
> > > > > > to use
> > > > 
> > > 
> > 
> 
> > > > > > profiling information and not outline from hot
areas.
> > > > > > Another
> > > > 
> > > 
> > 
> 
> > > > > > suggestion
> > > > 
> > > 
> > 
> 
> > > > > > people have given me is to add a "never
outline" directive
> > > > > > which
> > > > 
> > > 
> > 
> 
> > > > > > would
> > > > 
> > > 
> > 
> 
> > > > > > allow the user to say something along the lines of
"this is
> > > > > > a
> > > > > > hot
> > > > 
> > > 
> > 
> 
> > > > > > loop,
> > > > 
> > > 
> > 
> 
> > > > > > please never outline from it".
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > It's also important to note that these numbers
are coming
> > > > > > from
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > fairly
> > > > 
> > > 
> > 
> 
> > > > > > recent Intel processor.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 3) CONSTRAINTS ON INSTRUCTIONS
> > > > 
> > > 
> > 
> 
> > > > > > The outliner currently won't pull anything out
of functions
> > > > > > which
> > > > > > use
> > > > 
> > > 
> > 
> 
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > red zone. It also won't pull anything out that
uses the
> > > > > > stack,
> > > > 
> > > 
> > 
> 
> > > > > > instruction
> > > > 
> > > 
> > 
> 
> > > > > > pointer, uses constant pool indices, CFI indices,
jump
> > > > > > table
> > > > > > indices,
> > > > 
> > > 
> > 
> 
> > > > > > or
> > > > 
> > > 
> > 
> 
> > > > > > frame indices. This removes many opportunities for
> > > > > > outlining
> > > > > > which
> > > > 
> > > 
> > 
> 
> > > > > > would
> > > > 
> > > 
> > 
> 
> > > > > > likely be available at a higher level (such as
IR). Thus,
> > > > > > there's
> > > > > > a
> > > > 
> > > 
> > 
> 
> > > > > > case
> > > > 
> > > 
> > 
> 
> > > > > > for moving this up to a higher level.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Additional Applications
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > The suffix tree itself could be used as a tool for
finding
> > > > 
> > > 
> > 
> 
> > > > > > opportunities
> > > > 
> > > 
> > 
> 
> > > > > > to refactor code. For example, it could recognize
places
> > > > > > where
> > > > > > the
> > > > 
> > > 
> > 
> 
> > > > > > user
> > > > 
> > > 
> > 
> 
> > > > > > likely copied and pasted some code. This could be
run on
> > > > > > codebases
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > find
> > > > 
> > > 
> > 
> 
> > > > > > areas where people could manually outline things
at the
> > > > > > source
> > > > > > level.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Using the terminated string class, it would also
be
> > > > > > possible
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > implement
> > > > 
> > > 
> > 
> 
> > > > > > other string algorithms on code. This may open the
door to
> > > > > > new
> > > > > > ways
> > > > 
> > > 
> > 
> 
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > analyze existing codebases.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Roadmap
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > The current outliner is *very* prototypical. The
version I
> > > > > > would
> > > > > > want
> > > > 
> > > 
> > 
> 
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > upstream would be a new implementation. Here's
what I'd
> > > > > > like
> > > > > > to
> > > > 
> > > 
> > 
> 
> > > > > > address
> > > > 
> > > 
> > 
> 
> > > > > > and accomplish.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > 1. Ask "what does the LLVM community want
from an outliner"
> > > > > > and
> > > > > > use
> > > > 
> > > 
> > 
> 
> > > > > > that
> > > > 
> > > 
> > 
> 
> > > > > > to drive development of the algorithm.
> > > > 
> > > 
> > 
> 
> > > > > > 2. Reimplement the outliner, perhaps using a less
> > > > > > memory-intensve
> > > > 
> > > 
> > 
> 
> > > > > > data
> > > > 
> > > 
> > 
> 
> > > > > > structure like a suffix array.
> > > > 
> > > 
> > 
> 
> > > > > > 3. Begin adding features to the algorithm, for
example:
> > > > 
> > > 
> > 
> 
> > > > > > i. Teaching the algorithm about hot/cold blocks of
code and
> > > > 
> > > 
> > 
> 
> > > > > > taking
> > > > 
> > > 
> > 
> 
> > > > > > that into account.
> > > > 
> > > 
> > 
> 
> > > > > > ii. Simple parameter passing.
> > > > 
> > > 
> > 
> 
> > > > > > iii. Similar function outlining-- eg, noticing
that two
> > > > > > outlining
> > > > 
> > > 
> > 
> 
> > > > > > candidates are similar and can be merged into one
function
> > > > > > with
> > > > > > some
> > > > 
> > > 
> > 
> 
> > > > > > control flow.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Code
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > Note: This code requires MachineModulePasses
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * Main pass:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * Suffix tree:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * TerminatedString and TerminatedStringList:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Here are a couple unit tests for the data
structures.
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * Suffix tree unit tests:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * TerminatedString unit tests:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > * TerminatedStringList unit tests:
> > > > 
> > > 
> > 
> 
> > > > > >
https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > References
> > > > 
> > > 
> > 
> 
> > > > > > ===============================> > > >
> > > 
> > 
> 
> > > > > > [1] Ukkonen's Algorithm:
> > > > 
> > > 
> > 
> 
> > > > > >
https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> > > > 
> > > 
> > 
> 
> > > > > > [2] Suffix Trees and Suffix Arrays:
> > > > 
> > > 
> > 
> 
> > > > > > http://web.cs.iastate.edu/~cs548/suffix.pdf
> > > > 
> > > 
> > 
> 
> > > > > > [3] Extended Application of Suffix Trees to Data
> > > > > > Compression:
> > > > 
> > > 
> > 
> 
> > > > > > http://www.larsson.dogma.net/dccpaper.pdf
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > Thanks for reading,
> > > > 
> > > 
> > 
> 
> > > > > > Jessica
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > > LLVM Developers mailing list
> > > > 
> > > 
> > 
> 
> > > > > > llvm-dev at lists.llvm.org
> > > > 
> > > 
> > 
> 
> > > > > >
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > > 
> > > 
> > 
> 
> > > > > >
> > > > 
> > > 
> > 
> 
> > > > > --
> > > > 
> > > 
> > 
> 
> > > > > Hal Finkel
> > > > 
> > > 
> > 
> 
> > > > > Assistant Computational Scientist
> > > > 
> > > 
> > 
> 
> > > > > Leadership Computing Facility
> > > > 
> > > 
> > 
> 
> > > > > Argonne National Laboratory
> > > > 
> > > 
> > 
> 
> > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > > 
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > > 
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > > 
> > > 
> > 
> 
> > > --
> > 
> 
> > > Hal Finkel
> > 
> 
> > > Assistant Computational Scientist
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 
> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 
> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > llvm-dev at lists.llvm.org
> 
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/e5551446/attachment-0001.html>

Daniel Berlin via llvm-dev

2016-Aug-27 06:28 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

On Fri, Aug 26, 2016 at 9:44 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Daniel Berlin" <dberlin at dberlin.org>
> *To: *"Quentin Colombet" <qcolombet at apple.com>
> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
"llvm-dev" <llvm-dev at lists.llvm.org>
> *Sent: *Friday, August 26, 2016 11:06:56 PM
> *Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
>
> FWIW: I'm with quentin. Without a good value numbering analysis, this
is a
> hard problem.
>
> How exactly does value numbering help here? This problem seems to be about
> finding structurally-similar parts of the computations of different values,
> not identical values.
>Note, first, the optimization we are talking about is not outlining, at
least, as far as i've ever heard it.

function outlining/etc is about removing cold regions of functions to
separate functions to make the hot portions smaller, more inlinable, etc.

What is being talked about here is a variant of  identical code folding,
e.g., http://research.google.com/pubs/pub36912.html and it's variants.

Where identical varies (and is based on semantic equivalence).  Variants
where you extract portions partially to merge functions are pretty much the
 same optimization :)

Here, value numbering helps because it gives you value expressions.

It tells you:

This operation is "add V1, V2".
It does not matter what V1 and V2 are, they just "are abstract
things". We
know if these abstract things are equivalent or not.

More relevantly, these expressions form a value number dag.

That is, they represent the computations being computed only in terms of
operators, abstract variables, and other parts of the VNDAG

The *actual values* of those computations, numbers, etc, do not matter (IE
the abstract value numbers have some set of things *in the program* that
are in the set of things equivalent to that value number).

It is "are these semantically the same operations in the same order",
regardless of the value of those operations.
See, for example, the DAGS generated by:

https://www.researchgate.net/publication/221323142_An_Efficient_SSA-Based_Algorithm_for_Complete_Global_Value_Numbering

Again, this is not something current GVN does, but NewGVN (and other GVN's
do) can give you.

The question of what portions of a function you can merge are variants of
common subgraph and graph isomorphism problems on a VNDAG.

(because in the end, it doesn't matter where the computations are, it
matters what they abstractly compute, and what they depend on)


> It seems much closer to the analysis necessary for SLP vectorization than
> value numbering, as such.
>
> I think we might be able to use value numbering to help with subset of SLP
> vectorization, but only because we can define some kind of equivalence
> relation on consecutive memory accesses (and similar). What you need for
> this outlining seems more akin to the general SLP-vectorization case. This
> might just be the maximal common subgraph problem.
>
If i have an add in a loop, and it's VN equivalent to an add outside a loop
in some other function, i can still merge the functions :)

It's how the value is generated and used that matters, not the structure of
the program.

So you can't just do it on a regular CFG or anything like that if you want
good results.
>
>  -Hal
>
>
> GVN as it exists now doesn't really provide what you want, and in fact,
> doesn't value number a lot of instruction types (including, for
example,
> loads, stores, calls, phis, etc).  It also relies on ordering, and more
> importantly, it relies on *not* being a strong analysis. If you were to
> stop it from eliminating as it went, it would catch maybe 10% of the
> redundancies it does now.  So what you want is "i want to know what
> globally is the same", it doesn't really answer that well.
>
> Doing the analysis Quentin wants is pretty trivial in NewGVN (analysis and
> elimination are split, everything is broken into congruence classes
> explicitly, each congruence class has an id, you could sub in that id as
> the value for the terminated string), but i'd agree that GVN as it
exists
> now will not do what they want, and would be pretty hard to make work well.
>
> (FWIW: Davide Italiano has started breaking up newgvn into submittable
> pieces, and is pretty far along to having a first patch set I believe)
>
>
> On Fri, Aug 26, 2016 at 4:54 PM, Quentin Colombet via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I let Jessica give more details but here are some insights.
>>
>> MIR offers a straight forward way to model benefits, because we know
>> which instructions we remove and which one we add and there are no
overhead
>> of setting up parameters. Indeed, since the coloring will be the same
>> between the different outlining candidates, the call is just a jump
>> somewhere else. We do not have to worry about the impact of parameter
>> passing and ABI.
>>
>> So basically, better cost model. That's one part of the story.
>>
>> The other part is at the LLVM IR level or before register allocation
>> identifying similar code sequence is much harder, at least with a
suffix
>> tree like algorithm. Basically the problem is how do we name our
>> instructions such that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f;
>> }
>>
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>>
>> With proper renaming, we can outline both adds in one function. The
>> difficulty is to recognize that they are semantically equivalent to
give
>> them the same identifier in the suffix tree. I won’t get into the
details
>> but it gets tricky quickly. We were thinking of reusing GVN to have
such
>> identifier if we wanted to do the outlining at IR level but solving
this
>> problem is hard.
>>
>> By running after regalloc, we basically have a heuristic that does this
>> naming for us.
>>
>> Cheers,
>> -Quentin
>>
>>
>> On Aug 26, 2016, at 3:01 PM, Hal Finkel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>
>> ------------------------------
>>
>> *From: *"Kevin Choi" <code.kchoi at gmail.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>
>> *Sent: *Friday, August 26, 2016 4:55:29 PM
>> *Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass
>>
>> I think the "Motivation" section explained that.
>>
>> I don't think it explained it.
>>
>> I too first thought about "why not at IR?" but the reason
looks like MIR,
>> post-RA has the most accurate heuristics (best way to know looks like
>> actually getting there).
>>
>> But also, potentially, the fewest opportunities. That's why I'm
curious
>> about the motivation - the trade offs are not obvious to me.
>>
>>  -Hal
>>
>>
>>
>> Do you know if there is any experimental pass that relies on deriving
>> heuristics by a feedback loop after letting, ie. a duplicate
>> module/function/block continue past?
>>
>> Regards,
>> Kevin
>>
>> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.
>> org> wrote:
>>
>>> Hi Jessica,
>>>
>>> This is quite interesting.
>>>
>>> Can you comment on why you started by doing this at the MI level,
as
>>> opposed to the IR level. And at the MI level, why after RA instead
of
>>> before RA?
>>>
>>> Perhaps the first question is another way of asking about how
accurately
>>> we can model call-site code size at the IR level?
>>>
>>> Thanks in advance,
>>> Hal
>>>
>>> ------------------------------
>>>
>>> > From: "Jessica Paquette via llvm-dev" <llvm-dev
at lists.llvm.org>
>>> > To: llvm-dev at lists.llvm.org
>>> > Sent: Friday, August 26, 2016 4:26:09 PM
>>> > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass
>>> >
>>> > Hi everyone,
>>> >
>>> > Since I haven't said anything on the mailing list before,
a quick
>>> > introduction. I'm an intern at Apple, and over the summer
I
>>> > implemented a
>>> > prototype for an outlining pass for code size in LLVM. Now
I'm
>>> > seeking to
>>> > eventually upstream it. I have the preliminary code on GitHub
right
>>> > now,
>>> > but it's still very prototypical (see the code section).
>>> >
>>> > ===============================>>> > Motivation
>>> > ===============================>>> > The goal of
the internship was to create an interprocedural pass that
>>> > would reduce code size as much as possible, perhaps at the
cost of
>>> > some
>>> > performance. This would be useful to, say, embedded
programmers who
>>> > only
>>> > have a few kilobytes to work with and a substantial amount of
code to
>>> > fit
>>> > in that space.
>>> >
>>> >
>>> > ===============================>>> > Approach and
Initial Results
>>> > ===============================>>> > To do this,
we chose to implement an outliner. Outliners find
>>> > sequences of
>>> > instructions which would be better off as a function call, by
some
>>> > measure
>>> > of "better". In this case, the measure of
"better" is "makes code
>>> > smaller".
>>> >
>>> >
>>> > ===============================>>> > Results
>>> > ===============================>>> > These results
are from a fairly recent 64-bit Intel processor, using
>>> > a
>>> > version of Clang equipped with the outliner prototype versus
an
>>> > equivalent
>>> > version of Clang without the outliner.
>>> >
>>> > CODE SIZE REDUCTION
>>> > For tests >=4 Kb in non-outlined size, the outliner
currently
>>> > provides an
>>> > average of 12.94% code size reduction on the LLVM test suite
in
>>> > comparison
>>> > to a default Clang, up to 51.4% code size reduction. In
comparison to
>>> > a
>>> > Clang with -Oz, the outliner provides an average of a 1.29%
code size
>>> > reduction, up to a 37% code size reduction. I believe that the
-Oz
>>> > numbers
>>> > can be further improved by better tuning the outlining cost
model.
>>> >
>>> > EXECUTION TIME IMPACT
>>> > On average, the outliner increases execution time by 2% on the
LLVM
>>> > test
>>> > suite, but has been also shown to improve exection time by up
to 16%.
>>> > These results were from a fairly recent Intel processor, so
the
>>> > results
>>> > may vary. Recent Intel processors have very low latency for
function
>>> > calls, which may impact these results. Execution time
improvements
>>> > are
>>> > likely dependent on the latency of function calls, instruction
>>> > caching
>>> > behaviour, and the execution frequency of the code being
outlined. In
>>> > partucular, using a processor with heavy function call latency
will
>>> > likely
>>> > increase execution time overhead.
>>> >
>>> >
>>> > ===============================>>> >
Implementation
>>> > ===============================>>> > The outliner,
in its current state, is a MachineModulePass. It finds
>>> > *identical* sequences of MIR, after register allocation, and
pulls
>>> > them
>>> > out into their own functions. Thus, it's effectively
assembly-level.
>>> > Ultimately, the algorithm used is general, so it can sit
anywhere,
>>> > but MIR
>>> > was very convenient for the time being.
>>> >
>>> > It requires two data structures.
>>> >
>>> > 1. A generalized suffix tree
>>> > 2. A "terminated string"
>>> >
>>> > 1: The generalized suffix tree is constructed using
Ukkonen's linear
>>> > time
>>> > construction algorithm [1]. They require linear space and
support
>>> > linear-time substring queries. In practice, the construction
time for
>>> > the
>>> > suffix tree is the most time consuming part, but I haven't
noticed a
>>> > difference in compilation time on, say, 12 MB .ll files.
>>> >
>>> > 2: To support the suffix tree, we need a "terminated
string." This is
>>> > a
>>> > generalized string with an unique terminator character
appended to
>>> > the
>>> > end. TerminatedStrings can be built from any type.
>>> >
>>> > The algorithm is then roughly as follows.
>>> >
>>> > 1. For each MachineBasicBlock in the program, build a
>>> > TerminatedString for
>>> > that block.
>>> > 2. Build a suffix tree for that collection of strings.
>>> > 3. Query the suffix tree for the longest repeated substring
and place
>>> > that
>>> > string in a candidate list. Repeat until none are found.
>>> > 4. Create functions for each candidate.
>>> > 5. Replace each candidate with a call to its respective
function.
>>> >
>>> > Currently, the program itself isn't stored in the suffix
tree, but
>>> > rather
>>> > a "proxy string" of integers. This isn't
necessary at the MIR level,
>>> > but
>>> > may be for an IR level extension of the algorithm.
>>> >
>>> >
>>> > ===============================>>> > Challenges
>>> > ===============================>>> > 1) MEMORY
CONSUMPTION
>>> > Given a string of length n, a naive suffix tree implementation
can
>>> > take up
>>> > to 40n bytes of memory. However, this number can be reduced to
20n
>>> > with a
>>> > bit of work [2]. Currently, the suffix tree stores the entire
>>> > program,
>>> > including instructions which really ought not to be outlined,
such as
>>> > returns. These instructions should not be included in the
final
>>> > implementation, but should rather act as terminators for the
strings.
>>> > This
>>> > will likely curb memory consumption. Suffix trees have been
used in
>>> > the
>>> > past in sliding-window-based compression schemes, which may
serve as
>>> > a
>>> > source of inspiration for reducing memory overhead.[3]
>>> >
>>> > Nonetheless, the outliner probably shouldn't be run unless
it really
>>> > has
>>> > to be run. It will likely be used mostly in embedded spaces,
where
>>> > the
>>> > programs have to fit into small devices anyway. Thus, memory
overhead
>>> > for
>>> > the compiler likely won't be a problem. The outliner
should only be
>>> > used
>>> > in -Oz compilations, and possibly should have its own flag.
>>> >
>>> >
>>> > 2) EXECUTION TIME
>>> > Currently, the outliner isn't tuned for preventing
execution time
>>> > increases. There is an average of a 2% execution time hit on
the
>>> > tests in
>>> > the LLVM test suite, with a few outliers of up to 30%. The
outliers
>>> > are
>>> > tests which contain hot loops. The outliner really ought to be
able
>>> > to use
>>> > profiling information and not outline from hot areas. Another
>>> > suggestion
>>> > people have given me is to add a "never outline"
directive which
>>> > would
>>> > allow the user to say something along the lines of "this
is a hot
>>> > loop,
>>> > please never outline from it".
>>> >
>>> > It's also important to note that these numbers are coming
from a
>>> > fairly
>>> > recent Intel processor.
>>> >
>>> >
>>> > 3) CONSTRAINTS ON INSTRUCTIONS
>>> > The outliner currently won't pull anything out of
functions which use
>>> > a
>>> > red zone. It also won't pull anything out that uses the
stack,
>>> > instruction
>>> > pointer, uses constant pool indices, CFI indices, jump table
indices,
>>> > or
>>> > frame indices. This removes many opportunities for outlining
which
>>> > would
>>> > likely be available at a higher level (such as IR). Thus,
there's a
>>> > case
>>> > for moving this up to a higher level.
>>> >
>>> >
>>> > ===============================>>> > Additional
Applications
>>> > ===============================>>> > The suffix
tree itself could be used as a tool for finding
>>> > opportunities
>>> > to refactor code. For example, it could recognize places where
the
>>> > user
>>> > likely copied and pasted some code. This could be run on
codebases to
>>> > find
>>> > areas where people could manually outline things at the source
level.
>>> >
>>> > Using the terminated string class, it would also be possible
to
>>> > implement
>>> > other string algorithms on code. This may open the door to new
ways
>>> > to
>>> > analyze existing codebases.
>>> >
>>> >
>>> > ===============================>>> > Roadmap
>>> > ===============================>>> > The current
outliner is *very* prototypical. The version I would want
>>> > to
>>> > upstream would be a new implementation. Here's what
I'd like to
>>> > address
>>> > and accomplish.
>>> >
>>> > 1. Ask "what does the LLVM community want from an
outliner" and use
>>> > that
>>> > to drive development of the algorithm.
>>> > 2. Reimplement the outliner, perhaps using a less
memory-intensve
>>> > data
>>> > structure like a suffix array.
>>> > 3. Begin adding features to the algorithm, for example:
>>> >     i.   Teaching the algorithm about hot/cold blocks of code
and
>>> >     taking
>>> > that into account.
>>> >     ii.  Simple parameter passing.
>>> >     iii. Similar function outlining-- eg, noticing that two
outlining
>>> > candidates are similar and can be merged into one function
with some
>>> > control flow.
>>> >
>>> >
>>> > ===============================>>> > Code
>>> > ===============================>>> > Note: This
code requires MachineModulePasses
>>> >
>>> > * Main pass:
>>> > https://github.com/ornata/llvm/blob/master/lib/CodeGen/
>>> MachineOutliner.h
>>> >
>>> > * Suffix tree:
>>> > https://github.com/ornata/llvm/blob/master/include/llvm/
>>> ADT/SuffixTree.h
>>> >
>>> > * TerminatedString and TerminatedStringList:
>>> > https://github.com/ornata/llvm/blob/master/include/llvm/
>>> ADT/TerminatedString.h
>>> >
>>> > Here are a couple unit tests for the data structures.
>>> >
>>> > * Suffix tree unit tests:
>>> > https://github.com/ornata/llvm/blob/master/unittests/
>>> ADT/SuffixTreeTest.cpp
>>> >
>>> > * TerminatedString unit tests:
>>> > https://github.com/ornata/llvm/blob/master/unittests/
>>> ADT/TerminatedStringTest.cpp
>>> >
>>> > * TerminatedStringList unit tests:
>>> > https://github.com/ornata/llvm/blob/master/unittests/
>>> ADT/TerminatedStringListTest.cpp
>>> >
>>> >
>>> > ===============================>>> > References
>>> > ===============================>>> > [1]
Ukkonen's Algorithm:
>>> > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>>> > [2] Suffix Trees and Suffix Arrays:
>>> > http://web.cs.iastate.edu/~cs548/suffix.pdf
>>> > [3] Extended Application of Suffix Trees to Data Compression:
>>> > http://www.larsson.dogma.net/dccpaper.pdf
>>> >
>>> >
>>> > Thanks for reading,
>>> > Jessica
>>> >
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/45bdc0b7/attachment.html>

Philip Reames via llvm-dev

2016-Aug-27 20:43 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

On 08/26/2016 11:28 PM, Daniel Berlin via llvm-dev
wrote:>
>
> On Fri, Aug 26, 2016 at 9:44 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>    
------------------------------------------------------------------------
>
>         *From: *"Daniel Berlin" <dberlin at dberlin.org
>         <mailto:dberlin at dberlin.org>>
>         *To: *"Quentin Colombet" <qcolombet at apple.com
>         <mailto:qcolombet at apple.com>>
>         *Cc: *"Hal Finkel" <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>>,
>         "llvm-dev" <llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>>
>         *Sent: *Friday, August 26, 2016 11:06:56 PM
>         *Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level
>         outlining pass
>
>         FWIW: I'm with quentin. Without a good value numbering
>         analysis, this is a hard problem.
>
>     How exactly does value numbering help here? This problem seems to
>     be about finding structurally-similar parts of the computations of
>     different values, not identical values.
>
> Note, first, the optimization we are talking about is not outlining, 
> at least, as far as i've ever heard it.Minor terminology point...

The *mechanism* used here is definitely function outlining.  The policy 
applied to select outline candidates is variant of code folding 
techniques.  Many times when we refer to "an optimization" we tend to 
mix policy and mechanism into the same thing.  I'm highlighting the 
difference only because I found Danny's response slightly confusing on 
first read and through others might as well.

Back to your regularly scheduled technical discussion...>
> function outlining/etc is about removing cold regions of functions to 
> separate functions to make the hot portions smaller, more inlinable, etc.
>
> What is being talked about here is a variant of  identical code 
> folding, e.g., http://research.google.com/pubs/pub36912.html and it's 
> variants.
>
> Where identical varies (and is based on semantic equivalence).  
> Variants where you extract portions partially to merge functions are 
> pretty much the  same optimization :)
>
> Here, value numbering helps because it gives you value expressions.
>
> It tells you:
>
> This operation is "add V1, V2".
> It does not matter what V1 and V2 are, they just "are abstract 
> things". We know if these abstract things are equivalent or not.
>
> More relevantly, these expressions form a value number dag.
>
> That is, they represent the computations being computed only in terms 
> of operators, abstract variables, and other parts of the VNDAG
>
> The *actual values* of those computations, numbers, etc, do not matter 
> (IE the abstract value numbers have some set of things *in the 
> program* that are in the set of things equivalent to that value number).
>
> It is "are these semantically the same operations in the same
order",
> regardless of the value of those operations.
> See, for example, the DAGS generated by:
>
>
https://www.researchgate.net/publication/221323142_An_Efficient_SSA-Based_Algorithm_for_Complete_Global_Value_Numbering
>
> Again, this is not something current GVN does, but NewGVN (and other 
> GVN's do) can give you.
>
> The question of what portions of a function you can merge are variants 
> of common subgraph and graph isomorphism problems on a VNDAG.
>
> (because in the end, it doesn't matter where the computations are, it 
> matters what they abstractly compute, and what they depend on)
>
>
>     It seems much closer to the analysis necessary for SLP
>     vectorization than value numbering, as such.
>
>
>     I think we might be able to use value numbering to help with
>     subset of SLP vectorization, but only because we can define some
>     kind of equivalence relation on consecutive memory accesses (and
>     similar). What you need for this outlining seems more akin to the
>     general SLP-vectorization case. This might just be the maximal
>     common subgraph problem.
>
>
> If i have an add in a loop, and it's VN equivalent to an add outside a 
> loop in some other function, i can still merge the functions :)
>
> It's how the value is generated and used that matters, not the 
> structure of the program.
> So you can't just do it on a regular CFG or anything like that if you 
> want good results.
>
>
>      -Hal
>
>
>         GVN as it exists now doesn't really provide what you want, and
>         in fact, doesn't value number a lot of instruction types
>         (including, for example, loads, stores, calls, phis, etc).  It
>         also relies on ordering, and more importantly, it relies on
>         *not* being a strong analysis. If you were to stop it from
>         eliminating as it went, it would catch maybe 10% of the
>         redundancies it does now.  So what you want is "i want to know
>         what globally is the same", it doesn't really answer that
well.
>
>         Doing the analysis Quentin wants is pretty trivial in NewGVN
>         (analysis and elimination are split, everything is broken into
>         congruence classes explicitly, each congruence class has an
>         id, you could sub in that id as the value for the terminated
>         string), but i'd agree that GVN as it exists now will not do
>         what they want, and would be pretty hard to make work well.
>
>         (FWIW: Davide Italiano has started breaking up newgvn into
>         submittable pieces, and is pretty far along to having a first
>         patch set I believe)
>
>
>         On Fri, Aug 26, 2016 at 4:54 PM, Quentin Colombet via llvm-dev
>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>             Hi,
>
>             I let Jessica give more details but here are some insights.
>
>             MIR offers a straight forward way to model benefits,
>             because we know which instructions we remove and which one
>             we add and there are no overhead of setting up parameters.
>             Indeed, since the coloring will be the same between the
>             different outlining candidates, the call is just a jump
>             somewhere else. We do not have to worry about the impact
>             of parameter passing and ABI.
>
>             So basically, better cost model. That's one part of the
story.
>
>             The other part is at the LLVM IR level or before register
>             allocation identifying similar code sequence is much
>             harder, at least with a suffix tree like algorithm.
>             Basically the problem is how do we name our instructions
>             such that we can match them.
>             Let me take an example.
>             foo() {
>             /* bunch of code */
>             a = add b, c;
>             d = add e, f;
>             }
>
>             bar() {
>             d = add e, g;
>             f = add c, w;
>             }
>
>             With proper renaming, we can outline both adds in one
>             function. The difficulty is to recognize that they are
>             semantically equivalent to give them the same identifier
>             in the suffix tree. I won’t get into the details but it
>             gets tricky quickly. We were thinking of reusing GVN to
>             have such identifier if we wanted to do the outlining at
>             IR level but solving this problem is hard.
>
>             By running after regalloc, we basically have a heuristic
>             that does this naming for us.
>
>             Cheers,
>             -Quentin
>
>
>                 On Aug 26, 2016, at 3:01 PM, Hal Finkel via llvm-dev
>                 <llvm-dev at lists.llvm.org
>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>                
------------------------------------------------------------------------
>
>                     *From:*"Kevin Choi" <code.kchoi at
gmail.com
>                     <mailto:code.kchoi at gmail.com>>
>                     *To:*"Hal Finkel" <hfinkel at anl.gov
>                     <mailto:hfinkel at anl.gov>>
>                     *Cc:*"llvm-dev" <llvm-dev at
lists.llvm.org
>                     <mailto:llvm-dev at lists.llvm.org>>
>                     *Sent:*Friday, August 26, 2016 4:55:29 PM
>                     *Subject:*Re: [llvm-dev] [RFC] Interprocedural
>                     MIR-level outlining pass
>
>                     I think the "Motivation" section explained
that.
>
>                 I don't think it explained it.
>
>                     I too first thought about "why not at IR?"
but the
>                     reason looks like MIR, post-RA has the most
>                     accurate heuristics (best way to know looks like
>                     actually getting there).
>
>                 But also, potentially, the fewest opportunities.
>                 That's why I'm curious about the motivation - the
>                 trade offs are not obvious to me.
>
>                  -Hal
>
>
>
>                     Do you know if there is any experimental pass that
>                     relies on deriving heuristics by a feedback loop
>                     after letting, ie. a duplicate
>                     module/function/block continue past?
>
>                     Regards,
>                     Kevin
>
>                     On 26 August 2016 at 14:36, Hal Finkel via
>                     llvm-dev<llvm-dev at lists.llvm.org
>                     <mailto:llvm-dev at lists.llvm.org>>wrote:
>
>                         Hi Jessica,
>
>                         This is quite interesting.
>
>                         Can you comment on why you started by doing
>                         this at the MI level, as opposed to the IR
>                         level. And at the MI level, why after RA
>                         instead of before RA?
>
>                         Perhaps the first question is another way of
>                         asking about how accurately we can model
>                         call-site code size at the IR level?
>
>                         Thanks in advance,
>                         Hal
>
>                        
------------------------------------------------------------------------
>
>                         > From: "Jessica Paquette via
llvm-dev"
>                         <llvm-dev at lists.llvm.org
>                         <mailto:llvm-dev at lists.llvm.org>>
>                         > To:llvm-dev at lists.llvm.org
>                         <mailto:llvm-dev at lists.llvm.org>
>                         > Sent: Friday, August 26, 2016 4:26:09 PM
>                         > Subject: [llvm-dev] [RFC] Interprocedural
>                         MIR-level outlining pass
>                         >
>                         > Hi everyone,
>                         >
>                         > Since I haven't said anything on the
mailing
>                         list before, a quick
>                         > introduction. I'm an intern at Apple, and
>                         over the summer I
>                         > implemented a
>                         > prototype for an outlining pass for code
>                         size in LLVM. Now I'm
>                         > seeking to
>                         > eventually upstream it. I have the
>                         preliminary code on GitHub right
>                         > now,
>                         > but it's still very prototypical (see the
>                         code section).
>                         >
>                         > ===============================>           
> Motivation
>                         > ===============================>           
> The goal of the internship was to create an
>                         interprocedural pass that
>                         > would reduce code size as much as possible,
>                         perhaps at the cost of
>                         > some
>                         > performance. This would be useful to, say,
>                         embedded programmers who
>                         > only
>                         > have a few kilobytes to work with and a
>                         substantial amount of code to
>                         > fit
>                         > in that space.
>                         >
>                         >
>                         > ===============================>           
> Approach and Initial Results
>                         > ===============================>           
> To do this, we chose to implement an
>                         outliner. Outliners find
>                         > sequences of
>                         > instructions which would be better off as a
>                         function call, by some
>                         > measure
>                         > of "better". In this case, the
measure of
>                         "better" is "makes code
>                         > smaller".
>                         >
>                         >
>                         > ===============================>           
> Results
>                         > ===============================>           
> These results are from a fairly recent
>                         64-bit Intel processor, using
>                         > a
>                         > version of Clang equipped with the outliner
>                         prototype versus an
>                         > equivalent
>                         > version of Clang without the outliner.
>                         >
>                         > CODE SIZE REDUCTION
>                         > For tests >=4 Kb in non-outlined size, the
>                         outliner currently
>                         > provides an
>                         > average of 12.94% code size reduction on the
>                         LLVM test suite in
>                         > comparison
>                         > to a default Clang, up to 51.4% code size
>                         reduction. In comparison to
>                         > a
>                         > Clang with -Oz, the outliner provides an
>                         average of a 1.29% code size
>                         > reduction, up to a 37% code size reduction.
>                         I believe that the -Oz
>                         > numbers
>                         > can be further improved by better tuning the
>                         outlining cost model.
>                         >
>                         > EXECUTION TIME IMPACT
>                         > On average, the outliner increases execution
>                         time by 2% on the LLVM
>                         > test
>                         > suite, but has been also shown to improve
>                         exection time by up to 16%.
>                         > These results were from a fairly recent
>                         Intel processor, so the
>                         > results
>                         > may vary. Recent Intel processors have very
>                         low latency for function
>                         > calls, which may impact these results.
>                         Execution time improvements
>                         > are
>                         > likely dependent on the latency of function
>                         calls, instruction
>                         > caching
>                         > behaviour, and the execution frequency of
>                         the code being outlined. In
>                         > partucular, using a processor with heavy
>                         function call latency will
>                         > likely
>                         > increase execution time overhead.
>                         >
>                         >
>                         > ===============================>           
> Implementation
>                         > ===============================>           
> The outliner, in its current state, is a
>                         MachineModulePass. It finds
>                         > *identical* sequences of MIR, after register
>                         allocation, and pulls
>                         > them
>                         > out into their own functions. Thus, it's
>                         effectively assembly-level.
>                         > Ultimately, the algorithm used is general,
>                         so it can sit anywhere,
>                         > but MIR
>                         > was very convenient for the time being.
>                         >
>                         > It requires two data structures.
>                         >
>                         > 1. A generalized suffix tree
>                         > 2. A "terminated string"
>                         >
>                         > 1: The generalized suffix tree is
>                         constructed using Ukkonen's linear
>                         > time
>                         > construction algorithm [1]. They require
>                         linear space and support
>                         > linear-time substring queries. In practice,
>                         the construction time for
>                         > the
>                         > suffix tree is the most time consuming part,
>                         but I haven't noticed a
>                         > difference in compilation time on, say, 12
>                         MB .ll files.
>                         >
>                         > 2: To support the suffix tree, we need a
>                         "terminated string." This is
>                         > a
>                         > generalized string with an unique terminator
>                         character appended to
>                         > the
>                         > end. TerminatedStrings can be built from any
>                         type.
>                         >
>                         > The algorithm is then roughly as follows.
>                         >
>                         > 1. For each MachineBasicBlock in the
>                         program, build a
>                         > TerminatedString for
>                         > that block.
>                         > 2. Build a suffix tree for that collection
>                         of strings.
>                         > 3. Query the suffix tree for the longest
>                         repeated substring and place
>                         > that
>                         > string in a candidate list. Repeat until
>                         none are found.
>                         > 4. Create functions for each candidate.
>                         > 5. Replace each candidate with a call to its
>                         respective function.
>                         >
>                         > Currently, the program itself isn't stored
>                         in the suffix tree, but
>                         > rather
>                         > a "proxy string" of integers. This
isn't
>                         necessary at the MIR level,
>                         > but
>                         > may be for an IR level extension of the
>                         algorithm.
>                         >
>                         >
>                         > ===============================>           
> Challenges
>                         > ===============================>           
> 1) MEMORY CONSUMPTION
>                         > Given a string of length n, a naive suffix
>                         tree implementation can
>                         > take up
>                         > to 40n bytes of memory. However, this number
>                         can be reduced to 20n
>                         > with a
>                         > bit of work [2]. Currently, the suffix tree
>                         stores the entire
>                         > program,
>                         > including instructions which really ought
>                         not to be outlined, such as
>                         > returns. These instructions should not be
>                         included in the final
>                         > implementation, but should rather act as
>                         terminators for the strings.
>                         > This
>                         > will likely curb memory consumption. Suffix
>                         trees have been used in
>                         > the
>                         > past in sliding-window-based compression
>                         schemes, which may serve as
>                         > a
>                         > source of inspiration for reducing memory
>                         overhead.[3]
>                         >
>                         > Nonetheless, the outliner probably
shouldn't
>                         be run unless it really
>                         > has
>                         > to be run. It will likely be used mostly in
>                         embedded spaces, where
>                         > the
>                         > programs have to fit into small devices
>                         anyway. Thus, memory overhead
>                         > for
>                         > the compiler likely won't be a problem.
The
>                         outliner should only be
>                         > used
>                         > in -Oz compilations, and possibly should
>                         have its own flag.
>                         >
>                         >
>                         > 2) EXECUTION TIME
>                         > Currently, the outliner isn't tuned for
>                         preventing execution time
>                         > increases. There is an average of a 2%
>                         execution time hit on the
>                         > tests in
>                         > the LLVM test suite, with a few outliers of
>                         up to 30%. The outliers
>                         > are
>                         > tests which contain hot loops. The outliner
>                         really ought to be able
>                         > to use
>                         > profiling information and not outline from
>                         hot areas. Another
>                         > suggestion
>                         > people have given me is to add a "never
>                         outline" directive which
>                         > would
>                         > allow the user to say something along the
>                         lines of "this is a hot
>                         > loop,
>                         > please never outline from it".
>                         >
>                         > It's also important to note that these
>                         numbers are coming from a
>                         > fairly
>                         > recent Intel processor.
>                         >
>                         >
>                         > 3) CONSTRAINTS ON INSTRUCTIONS
>                         > The outliner currently won't pull anything
>                         out of functions which use
>                         > a
>                         > red zone. It also won't pull anything out
>                         that uses the stack,
>                         > instruction
>                         > pointer, uses constant pool indices, CFI
>                         indices, jump table indices,
>                         > or
>                         > frame indices. This removes many
>                         opportunities for outlining which
>                         > would
>                         > likely be available at a higher level (such
>                         as IR). Thus, there's a
>                         > case
>                         > for moving this up to a higher level.
>                         >
>                         >
>                         > ===============================>           
> Additional Applications
>                         > ===============================>           
> The suffix tree itself could be used as a
>                         tool for finding
>                         > opportunities
>                         > to refactor code. For example, it could
>                         recognize places where the
>                         > user
>                         > likely copied and pasted some code. This
>                         could be run on codebases to
>                         > find
>                         > areas where people could manually outline
>                         things at the source level.
>                         >
>                         > Using the terminated string class, it would
>                         also be possible to
>                         > implement
>                         > other string algorithms on code. This may
>                         open the door to new ways
>                         > to
>                         > analyze existing codebases.
>                         >
>                         >
>                         > ===============================>           
> Roadmap
>                         > ===============================>           
> The current outliner is *very* prototypical.
>                         The version I would want
>                         > to
>                         > upstream would be a new implementation.
>                         Here's what I'd like to
>                         > address
>                         > and accomplish.
>                         >
>                         > 1. Ask "what does the LLVM community want
>                         from an outliner" and use
>                         > that
>                         > to drive development of the algorithm.
>                         > 2. Reimplement the outliner, perhaps using a
>                         less memory-intensve
>                         > data
>                         > structure like a suffix array.
>                         > 3. Begin adding features to the algorithm,
>                         for example:
>                         >     i.  Teaching the algorithm about
>                         hot/cold blocks of code and
>                         >  taking
>                         > that into account.
>                         >     ii. Simple parameter passing.
>                         >     iii. Similar function outlining-- eg,
>                         noticing that two outlining
>                         > candidates are similar and can be merged
>                         into one function with some
>                         > control flow.
>                         >
>                         >
>                         > ===============================>           
> Code
>                         > ===============================>           
> Note: This code requires MachineModulePasses
>                         >
>                         > * Main pass:
>                        
>https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>                        
<https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h>
>                         >
>                         > * Suffix tree:
>                        
>https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>                        
<https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h>
>                         >
>                         > * TerminatedString and TerminatedStringList:
>                        
>https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
>                        
<https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h>
>                         >
>                         > Here are a couple unit tests for the data
>                         structures.
>                         >
>                         > * Suffix tree unit tests:
>                        
>https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
>                        
<https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp>
>                         >
>                         > * TerminatedString unit tests:
>                        
>https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
>                        
<https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp>
>                         >
>                         > * TerminatedStringList unit tests:
>                        
>https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
>                        
<https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp>
>                         >
>                         >
>                         > ===============================>           
> References
>                         > ===============================>           
> [1] Ukkonen's Algorithm:
>                        
>https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>                        
<https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf>
>                         > [2] Suffix Trees and Suffix Arrays:
>                         >http://web.cs.iastate.edu/~cs548/suffix.pdf
>                        
<http://web.cs.iastate.edu/%7Ecs548/suffix.pdf>
>                         > [3] Extended Application of Suffix Trees to
>                         Data Compression:
>                         >http://www.larsson.dogma.net/dccpaper.pdf
>                         <http://www.larsson.dogma.net/dccpaper.pdf>
>                         >
>                         >
>                         > Thanks for reading,
>                         > Jessica
>                         >
>                         >
_______________________________________________
>                         > LLVM Developers mailing list
>                         >llvm-dev at lists.llvm.org
>                         <mailto:llvm-dev at lists.llvm.org>
>                        
>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>                        
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>                         >
>
>                         --
>                         Hal Finkel
>                         Assistant Computational Scientist
>                         Leadership Computing Facility
>                         Argonne National Laboratory
>                         _______________________________________________
>                         LLVM Developers mailing list
>                         llvm-dev at lists.llvm.org
>                         <mailto:llvm-dev at lists.llvm.org>
>                        
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>                        
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>
>                 --
>                 Hal Finkel
>                 Assistant Computational Scientist
>                 Leadership Computing Facility
>                 Argonne National Laboratory
>                 _______________________________________________
>                 LLVM Developers mailing list
>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>                
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>             _______________________________________________
>             LLVM Developers mailing list
>             llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>             http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>
>     -- 
>     Hal Finkel
>     Assistant Computational Scientist
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160827/a2c6195b/attachment-0001.html>

Andrey Bokhanko via llvm-dev

2016-Aug-29 19:35 UTC

head link

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Daniel,

I wonder what the NewGVN would generate for the following C code:

int a, b;

int foo() {
  return a + b;
}

int bar() {
  return a + b;
}

?

Obviously, the expressions would be the same ("value1 + value2"), but
a
single operator is not worthy to be outlined. What classes of congruency
would be assigned to operands? The same for both reads of "a" and
"b"? If
yes, it would be incorrect, as obviously, these two additions are not
guaranteed to be semantically equivalent (a's and b's values can be
changed
in-between).

Also note that one should be able to compare VN's results from different
functions. Looks like an inter-procedural VN is required.

Yours,
Andrey


On Sat, Aug 27, 2016 at 9:28 AM, Daniel Berlin via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Fri, Aug 26, 2016 at 9:44 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Daniel Berlin" <dberlin at dberlin.org>
>> *To: *"Quentin Colombet" <qcolombet at apple.com>
>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
"llvm-dev" <llvm-dev at lists.llvm.org
>> >
>> *Sent: *Friday, August 26, 2016 11:06:56 PM
>> *Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass
>>
>> FWIW: I'm with quentin. Without a good value numbering analysis,
this is
>> a hard problem.
>>
>> How exactly does value numbering help here? This problem seems to be
>> about finding structurally-similar parts of the computations of
different
>> values, not identical values.
>>
> Note, first, the optimization we are talking about is not outlining, at
> least, as far as i've ever heard it.
>
> function outlining/etc is about removing cold regions of functions to
> separate functions to make the hot portions smaller, more inlinable, etc.
>
> What is being talked about here is a variant of  identical code folding,
> e.g., http://research.google.com/pubs/pub36912.html and it's variants.
>
> Where identical varies (and is based on semantic equivalence).  Variants
> where you extract portions partially to merge functions are pretty much the
>  same optimization :)
>
> Here, value numbering helps because it gives you value expressions.
>
> It tells you:
>
> This operation is "add V1, V2".
> It does not matter what V1 and V2 are, they just "are abstract
things". We
> know if these abstract things are equivalent or not.
>
> More relevantly, these expressions form a value number dag.
>
> That is, they represent the computations being computed only in terms of
> operators, abstract variables, and other parts of the VNDAG
>
> The *actual values* of those computations, numbers, etc, do not matter (IE
> the abstract value numbers have some set of things *in the program* that
> are in the set of things equivalent to that value number).
>
> It is "are these semantically the same operations in the same
order",
> regardless of the value of those operations.
> See, for example, the DAGS generated by:
>
> https://www.researchgate.net/publication/221323142_An_
> Efficient_SSA-Based_Algorithm_for_Complete_Global_Value_Numbering
>
> Again, this is not something current GVN does, but NewGVN (and other
GVN's
> do) can give you.
>
> The question of what portions of a function you can merge are variants of
> common subgraph and graph isomorphism problems on a VNDAG.
>
> (because in the end, it doesn't matter where the computations are, it
> matters what they abstractly compute, and what they depend on)
>
>
>
>> It seems much closer to the analysis necessary for SLP vectorization
than
>> value numbering, as such.
>>
>
>
>> I think we might be able to use value numbering to help with subset of
>> SLP vectorization, but only because we can define some kind of
equivalence
>> relation on consecutive memory accesses (and similar). What you need
for
>> this outlining seems more akin to the general SLP-vectorization case.
This
>> might just be the maximal common subgraph problem.
>>
>
> If i have an add in a loop, and it's VN equivalent to an add outside a
> loop in some other function, i can still merge the functions :)
>
> It's how the value is generated and used that matters, not the
structure
> of the program.
>
> So you can't just do it on a regular CFG or anything like that if you
want
> good results.
>
>>
>>  -Hal
>>
>>
>> GVN as it exists now doesn't really provide what you want, and in
fact,
>> doesn't value number a lot of instruction types (including, for
example,
>> loads, stores, calls, phis, etc).  It also relies on ordering, and more
>> importantly, it relies on *not* being a strong analysis. If you were to
>> stop it from eliminating as it went, it would catch maybe 10% of the
>> redundancies it does now.  So what you want is "i want to know
what
>> globally is the same", it doesn't really answer that well.
>>
>> Doing the analysis Quentin wants is pretty trivial in NewGVN (analysis
>> and elimination are split, everything is broken into congruence classes
>> explicitly, each congruence class has an id, you could sub in that id
as
>> the value for the terminated string), but i'd agree that GVN as it
exists
>> now will not do what they want, and would be pretty hard to make work
well.
>>
>> (FWIW: Davide Italiano has started breaking up newgvn into submittable
>> pieces, and is pretty far along to having a first patch set I believe)
>>
>>
>> On Fri, Aug 26, 2016 at 4:54 PM, Quentin Colombet via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> I let Jessica give more details but here are some insights.
>>>
>>> MIR offers a straight forward way to model benefits, because we
know
>>> which instructions we remove and which one we add and there are no
overhead
>>> of setting up parameters. Indeed, since the coloring will be the
same
>>> between the different outlining candidates, the call is just a jump
>>> somewhere else. We do not have to worry about the impact of
parameter
>>> passing and ABI.
>>>
>>> So basically, better cost model. That's one part of the story.
>>>
>>> The other part is at the LLVM IR level or before register
allocation
>>> identifying similar code sequence is much harder, at least with a
suffix
>>> tree like algorithm. Basically the problem is how do we name our
>>> instructions such that we can match them.
>>> Let me take an example.
>>> foo() {
>>> /* bunch of code */
>>> a = add b, c;
>>> d = add e, f;
>>> }
>>>
>>> bar() {
>>> d = add e, g;
>>> f = add c, w;
>>> }
>>>
>>> With proper renaming, we can outline both adds in one function. The
>>> difficulty is to recognize that they are semantically equivalent to
give
>>> them the same identifier in the suffix tree. I won’t get into the
details
>>> but it gets tricky quickly. We were thinking of reusing GVN to have
such
>>> identifier if we wanted to do the outlining at IR level but solving
this
>>> problem is hard.
>>>
>>> By running after regalloc, we basically have a heuristic that does
this
>>> naming for us.
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>
>>> On Aug 26, 2016, at 3:01 PM, Hal Finkel via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>> ------------------------------
>>>
>>> *From: *"Kevin Choi" <code.kchoi at gmail.com>
>>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>
>>> *Sent: *Friday, August 26, 2016 4:55:29 PM
>>> *Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass
>>>
>>> I think the "Motivation" section explained that.
>>>
>>> I don't think it explained it.
>>>
>>> I too first thought about "why not at IR?" but the reason
looks like
>>> MIR, post-RA has the most accurate heuristics (best way to know
looks like
>>> actually getting there).
>>>
>>> But also, potentially, the fewest opportunities. That's why
I'm curious
>>> about the motivation - the trade offs are not obvious to me.
>>>
>>>  -Hal
>>>
>>>
>>>
>>> Do you know if there is any experimental pass that relies on
deriving
>>> heuristics by a feedback loop after letting, ie. a duplicate
>>> module/function/block continue past?
>>>
>>> Regards,
>>> Kevin
>>>
>>> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi Jessica,
>>>>
>>>> This is quite interesting.
>>>>
>>>> Can you comment on why you started by doing this at the MI
level, as
>>>> opposed to the IR level. And at the MI level, why after RA
instead of
>>>> before RA?
>>>>
>>>> Perhaps the first question is another way of asking about how
>>>> accurately we can model call-site code size at the IR level?
>>>>
>>>> Thanks in advance,
>>>> Hal
>>>>
>>>> ------------------------------
>>>>
>>>> > From: "Jessica Paquette via llvm-dev"
<llvm-dev at lists.llvm.org>
>>>> > To: llvm-dev at lists.llvm.org
>>>> > Sent: Friday, August 26, 2016 4:26:09 PM
>>>> > Subject: [llvm-dev] [RFC] Interprocedural MIR-level
outlining pass
>>>> >
>>>> > Hi everyone,
>>>> >
>>>> > Since I haven't said anything on the mailing list
before, a quick
>>>> > introduction. I'm an intern at Apple, and over the
summer I
>>>> > implemented a
>>>> > prototype for an outlining pass for code size in LLVM. Now
I'm
>>>> > seeking to
>>>> > eventually upstream it. I have the preliminary code on
GitHub right
>>>> > now,
>>>> > but it's still very prototypical (see the code
section).
>>>> >
>>>> > ===============================>>>> >
Motivation
>>>> > ===============================>>>> > The
goal of the internship was to create an interprocedural pass that
>>>> > would reduce code size as much as possible, perhaps at the
cost of
>>>> > some
>>>> > performance. This would be useful to, say, embedded
programmers who
>>>> > only
>>>> > have a few kilobytes to work with and a substantial amount
of code to
>>>> > fit
>>>> > in that space.
>>>> >
>>>> >
>>>> > ===============================>>>> >
Approach and Initial Results
>>>> > ===============================>>>> > To do
this, we chose to implement an outliner. Outliners find
>>>> > sequences of
>>>> > instructions which would be better off as a function call,
by some
>>>> > measure
>>>> > of "better". In this case, the measure of
"better" is "makes code
>>>> > smaller".
>>>> >
>>>> >
>>>> > ===============================>>>> >
Results
>>>> > ===============================>>>> > These
results are from a fairly recent 64-bit Intel processor, using
>>>> > a
>>>> > version of Clang equipped with the outliner prototype
versus an
>>>> > equivalent
>>>> > version of Clang without the outliner.
>>>> >
>>>> > CODE SIZE REDUCTION
>>>> > For tests >=4 Kb in non-outlined size, the outliner
currently
>>>> > provides an
>>>> > average of 12.94% code size reduction on the LLVM test
suite in
>>>> > comparison
>>>> > to a default Clang, up to 51.4% code size reduction. In
comparison to
>>>> > a
>>>> > Clang with -Oz, the outliner provides an average of a
1.29% code size
>>>> > reduction, up to a 37% code size reduction. I believe that
the -Oz
>>>> > numbers
>>>> > can be further improved by better tuning the outlining
cost model.
>>>> >
>>>> > EXECUTION TIME IMPACT
>>>> > On average, the outliner increases execution time by 2% on
the LLVM
>>>> > test
>>>> > suite, but has been also shown to improve exection time by
up to 16%.
>>>> > These results were from a fairly recent Intel processor,
so the
>>>> > results
>>>> > may vary. Recent Intel processors have very low latency
for function
>>>> > calls, which may impact these results. Execution time
improvements
>>>> > are
>>>> > likely dependent on the latency of function calls,
instruction
>>>> > caching
>>>> > behaviour, and the execution frequency of the code being
outlined. In
>>>> > partucular, using a processor with heavy function call
latency will
>>>> > likely
>>>> > increase execution time overhead.
>>>> >
>>>> >
>>>> > ===============================>>>> >
Implementation
>>>> > ===============================>>>> > The
outliner, in its current state, is a MachineModulePass. It finds
>>>> > *identical* sequences of MIR, after register allocation,
and pulls
>>>> > them
>>>> > out into their own functions. Thus, it's effectively
assembly-level.
>>>> > Ultimately, the algorithm used is general, so it can sit
anywhere,
>>>> > but MIR
>>>> > was very convenient for the time being.
>>>> >
>>>> > It requires two data structures.
>>>> >
>>>> > 1. A generalized suffix tree
>>>> > 2. A "terminated string"
>>>> >
>>>> > 1: The generalized suffix tree is constructed using
Ukkonen's linear
>>>> > time
>>>> > construction algorithm [1]. They require linear space and
support
>>>> > linear-time substring queries. In practice, the
construction time for
>>>> > the
>>>> > suffix tree is the most time consuming part, but I
haven't noticed a
>>>> > difference in compilation time on, say, 12 MB .ll files.
>>>> >
>>>> > 2: To support the suffix tree, we need a "terminated
string." This is
>>>> > a
>>>> > generalized string with an unique terminator character
appended to
>>>> > the
>>>> > end. TerminatedStrings can be built from any type.
>>>> >
>>>> > The algorithm is then roughly as follows.
>>>> >
>>>> > 1. For each MachineBasicBlock in the program, build a
>>>> > TerminatedString for
>>>> > that block.
>>>> > 2. Build a suffix tree for that collection of strings.
>>>> > 3. Query the suffix tree for the longest repeated
substring and place
>>>> > that
>>>> > string in a candidate list. Repeat until none are found.
>>>> > 4. Create functions for each candidate.
>>>> > 5. Replace each candidate with a call to its respective
function.
>>>> >
>>>> > Currently, the program itself isn't stored in the
suffix tree, but
>>>> > rather
>>>> > a "proxy string" of integers. This isn't
necessary at the MIR level,
>>>> > but
>>>> > may be for an IR level extension of the algorithm.
>>>> >
>>>> >
>>>> > ===============================>>>> >
Challenges
>>>> > ===============================>>>> > 1)
MEMORY CONSUMPTION
>>>> > Given a string of length n, a naive suffix tree
implementation can
>>>> > take up
>>>> > to 40n bytes of memory. However, this number can be
reduced to 20n
>>>> > with a
>>>> > bit of work [2]. Currently, the suffix tree stores the
entire
>>>> > program,
>>>> > including instructions which really ought not to be
outlined, such as
>>>> > returns. These instructions should not be included in the
final
>>>> > implementation, but should rather act as terminators for
the strings.
>>>> > This
>>>> > will likely curb memory consumption. Suffix trees have
been used in
>>>> > the
>>>> > past in sliding-window-based compression schemes, which
may serve as
>>>> > a
>>>> > source of inspiration for reducing memory overhead.[3]
>>>> >
>>>> > Nonetheless, the outliner probably shouldn't be run
unless it really
>>>> > has
>>>> > to be run. It will likely be used mostly in embedded
spaces, where
>>>> > the
>>>> > programs have to fit into small devices anyway. Thus,
memory overhead
>>>> > for
>>>> > the compiler likely won't be a problem. The outliner
should only be
>>>> > used
>>>> > in -Oz compilations, and possibly should have its own
flag.
>>>> >
>>>> >
>>>> > 2) EXECUTION TIME
>>>> > Currently, the outliner isn't tuned for preventing
execution time
>>>> > increases. There is an average of a 2% execution time hit
on the
>>>> > tests in
>>>> > the LLVM test suite, with a few outliers of up to 30%. The
outliers
>>>> > are
>>>> > tests which contain hot loops. The outliner really ought
to be able
>>>> > to use
>>>> > profiling information and not outline from hot areas.
Another
>>>> > suggestion
>>>> > people have given me is to add a "never outline"
directive which
>>>> > would
>>>> > allow the user to say something along the lines of
"this is a hot
>>>> > loop,
>>>> > please never outline from it".
>>>> >
>>>> > It's also important to note that these numbers are
coming from a
>>>> > fairly
>>>> > recent Intel processor.
>>>> >
>>>> >
>>>> > 3) CONSTRAINTS ON INSTRUCTIONS
>>>> > The outliner currently won't pull anything out of
functions which use
>>>> > a
>>>> > red zone. It also won't pull anything out that uses
the stack,
>>>> > instruction
>>>> > pointer, uses constant pool indices, CFI indices, jump
table indices,
>>>> > or
>>>> > frame indices. This removes many opportunities for
outlining which
>>>> > would
>>>> > likely be available at a higher level (such as IR). Thus,
there's a
>>>> > case
>>>> > for moving this up to a higher level.
>>>> >
>>>> >
>>>> > ===============================>>>> >
Additional Applications
>>>> > ===============================>>>> > The
suffix tree itself could be used as a tool for finding
>>>> > opportunities
>>>> > to refactor code. For example, it could recognize places
where the
>>>> > user
>>>> > likely copied and pasted some code. This could be run on
codebases to
>>>> > find
>>>> > areas where people could manually outline things at the
source level.
>>>> >
>>>> > Using the terminated string class, it would also be
possible to
>>>> > implement
>>>> > other string algorithms on code. This may open the door to
new ways
>>>> > to
>>>> > analyze existing codebases.
>>>> >
>>>> >
>>>> > ===============================>>>> >
Roadmap
>>>> > ===============================>>>> > The
current outliner is *very* prototypical. The version I would want
>>>> > to
>>>> > upstream would be a new implementation. Here's what
I'd like to
>>>> > address
>>>> > and accomplish.
>>>> >
>>>> > 1. Ask "what does the LLVM community want from an
outliner" and use
>>>> > that
>>>> > to drive development of the algorithm.
>>>> > 2. Reimplement the outliner, perhaps using a less
memory-intensve
>>>> > data
>>>> > structure like a suffix array.
>>>> > 3. Begin adding features to the algorithm, for example:
>>>> >     i.   Teaching the algorithm about hot/cold blocks of
code and
>>>> >     taking
>>>> > that into account.
>>>> >     ii.  Simple parameter passing.
>>>> >     iii. Similar function outlining-- eg, noticing that
two outlining
>>>> > candidates are similar and can be merged into one function
with some
>>>> > control flow.
>>>> >
>>>> >
>>>> > ===============================>>>> > Code
>>>> > ===============================>>>> > Note:
This code requires MachineModulePasses
>>>> >
>>>> > * Main pass:
>>>> > https://github.com/ornata/llvm/blob/master/lib/CodeGen/Mac
>>>> hineOutliner.h
>>>> >
>>>> > * Suffix tree:
>>>> > https://github.com/ornata/llvm/blob/master/include/llvm/AD
>>>> T/SuffixTree.h
>>>> >
>>>> > * TerminatedString and TerminatedStringList:
>>>> > https://github.com/ornata/llvm/blob/master/include/llvm/AD
>>>> T/TerminatedString.h
>>>> >
>>>> > Here are a couple unit tests for the data structures.
>>>> >
>>>> > * Suffix tree unit tests:
>>>> > https://github.com/ornata/llvm/blob/master/unittests/ADT/
>>>> SuffixTreeTest.cpp
>>>> >
>>>> > * TerminatedString unit tests:
>>>> > https://github.com/ornata/llvm/blob/master/unittests/ADT/
>>>> TerminatedStringTest.cpp
>>>> >
>>>> > * TerminatedStringList unit tests:
>>>> > https://github.com/ornata/llvm/blob/master/unittests/ADT/
>>>> TerminatedStringListTest.cpp
>>>> >
>>>> >
>>>> > ===============================>>>> >
References
>>>> > ===============================>>>> > [1]
Ukkonen's Algorithm:
>>>> > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>>>> > [2] Suffix Trees and Suffix Arrays:
>>>> > http://web.cs.iastate.edu/~cs548/suffix.pdf
>>>> > [3] Extended Application of Suffix Trees to Data
Compression:
>>>> > http://www.larsson.dogma.net/dccpaper.pdf
>>>> >
>>>> >
>>>> > Thanks for reading,
>>>> > Jessica
>>>> >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > llvm-dev at lists.llvm.org
>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> >
>>>>
>>>> --
>>>> Hal Finkel
>>>> Assistant Computational Scientist
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160829/8e7aafcf/attachment-0001.html>

llvm dev - Aug 2016 - [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

[llvm-dev] [RFC] Interprocedural MIR-level outlining pass