thr3ads.net - llvm dev - [llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Mircea Trofin via llvm-dev

2020-Apr-08 21:23 UTC

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

It turns out it's me, sorry. Let me see how I can sort this out. In the
meantime, here is the csv:

SPEC2006 data:

binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
scores,perf: ML -Oz scores,ML improvement by
400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
,,,,,,,
SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%


various benchmarks:

binary,base -Oz size,ML -Oz size,ML shrink over base
llvm:opt,66318624,62124256,6.32%
math_1,6181552,5884144,4.81%
event_management_1,4998728,4802696,3.92%
llvm:lcalsBRaw,1270168,1222168,3.78%
llvm:lcalsBLambda,1270232,1222232,3.78%
llvm:lcalsARaw,1276248,1228248,3.76%
llvm:lcalsALambda,1276440,1228504,3.76%
llvm:lcalsCRaw,1278936,1231000,3.75%
llvm:lcalsCLambda,1279064,1231256,3.74%
llvm:Blur,1236888,1191192,3.69%
image_processing_diffusion,1236120,1190488,3.69%
llvm:Interpolation,1237208,1191576,3.69%
llvm:harris,1237208,1191768,3.67%
llvm:Dither,1238232,1192792,3.67%
event_management_2,4741096,4568584,3.64%
hashing_1,5189360,5004176,3.57%
compression_2,5329392,5140496,3.54%
math_2,6072336,5858992,3.51%
crypto_1,4721576,4556008,3.51%
math_4,27720208,26755568,3.48%
math_3,27732496,26767920,3.48%
infra_1,30673232,29630464,3.40%
hashing_2,5834544,5637168,3.38%
image_processing_entropy,5854960,5657008,3.38%
hashing_3,5831088,5634288,3.38%
memory_management_1,4957960,4790632,3.37%
hasing_4,5816048,5619888,3.37%
data_storage_2,5852976,5655792,3.37%
compression_1,5971984,5771120,3.36%
arena_unittest,6672944,6453584,3.29%
data_storage_3,44113232,42710960,3.18%
data_storage_1,22858992,22132960,3.18%
datastructures_1,6362032,6161264,3.16%
infra_2,56828720,55053496,3.12%
datastructures_2,164558704,160790320,2.29%
llvm:clang,77027416,75672088,1.76%
benchmark_infra,20221424,19944432,1.37%
search_1,169475360,167645632,1.08%
protobuf_3,24071504,23895440,0.73%
protobuf_1,24071504,23895504,0.73%
protobuf_2,24071504,23895504,0.73%
protobuf_4,24071504,23895504,0.73%
protobuf_5,24071504,23895504,0.73%
protobuf_6,23982800,23809424,0.72%

On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <johannesdoerfert at
gmail.com>
wrote:
>
> Cool!
>
> I skipped to the end and tried to access the gdoc and the spreadsheet
> but it did tell me I need permissions. Can you make them accessible or
> am I the problem?
>
> Thanks,
>    Johannes
>
> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>  > TL;DR; We can improve compiler optimizations driven by heuristics by
>  > replacing those heuristics with machine-learned policies (ML models).
>  > Policies are trained offline and ship as part of the compiler.
> Determinism
>  > is maintained because models are fixed when the compiler is operating
in
>  > production. Fine-tuning or regressions may be handled by
> incorporating the
>  > interesting cases in the ML training set, retraining the compiler,
and
>  > redeploying it.
>  >
>  > For a first milestone, we chose inlining for size (-Oz) on X86-64. We
> were
>  > able to train an ML model to produce binaries 1.5-6% smaller than -Oz
of
>  > tip-of-tree. The trained model appears to generalize well over a
diverse
>  > set of binaries. Compile time is increased marginally (under 5%). The
> model
>  > also happens to produce slightly better - performing code under
> SPEC2006 -
>  > total score improvement by 1.75%. As we only wanted to verify there
is
> no
>  > significant regression in SPEC, and given the milestone goals, we
> haven’t
>  > dug any deeper into the speed results.
>  >
>  > We see these results as promising, and as a reasonable point for
>  > contributing our current work as a build-time opt-in to LLVM to
> benefit the
>  > community, in the hope of fostering collaboration and learning from
the
>  > community’s feedback, as we try to better understand the trade-offs
> such an
>  > approach entails, and as we work on expanding the depth and breadth
of
>  > applying these techniques to compiler optimization problems.
>  >
>  > https://reviews.llvm.org/D77752
>  > Motivation
>  >
>  > Optimization problems, such as inlining or register allocation, are
> hard:
>  > we don’t know of algorithms that are guaranteed to produce the
optimum
>  > solution in all cases in a timely fashion. Instead, we use
heuristics:
>  > algorithms that we expect to produce some approximation of the
optimum,
>  > with some expected degree of generality, in a practical amount of
time.
>  >
>  > Heuristics have some common characteristics. Taking inlining as a
case
>  > study, it traverses the problem space in some way (bottom-up
traversal
> of
>  > the SCC graph), extracts some properties (let’s call them “features”)
of
>  > the program being optimized, and combine them with some weights
> (“tune”),
>  > to produce a cost (InlineCost), which allows for trade-off analysis.
We
>  > validate the effectiveness of a heuristic over some accepted set of
>  > benchmarks. Over time, we react to regressions or pathological cases
>  > observed in the field, by manually analyzing such cases, figuring out
an
>  > enhancement to the heuristic, and then re-validating over that set of
>  > benchmarks (maybe augmented by adding the newly found cases).
>  >
>  > Because heuristics are code that needs to be maintained, there is
> pressure
>  > to reduce complexity: adding more features means we need to reason
about
>  > the interactions between the old and new features, which scales
>  > combinatorially. Re-tuning because of the new features adds a similar
> kind
>  > of complexity. Potentially, we miss out on optimization improvements
as
> a
>  > result.
>  >
>  > Because tuning is manual, there is pressure to keep the number of
>  > benchmarks that can be studied in depth to a humanly-manageable size,
> which
>  > potentially affects the generality of a heuristic or heuristic
tuning.
>  >
>  > The main advantage of manual heuristics is arguably their relatively
> lower
>  > overhead: no additional dependencies and more transparent to human
> analysis
>  > and improvement.
>  >
>  > Machine learning, in particular reinforcement learning, can address
the
>  > tensions found in manual heuristics: once features are extracted from
> the
>  > program, the way they are combined and tuned can easily be scaled up
>  > through automation, improving effectiveness and generality. A major
>  > drawback, at least at this point in time, of machine learning, is
that
> we
>  > don’t yet have a fully developed systematic approach for improving
> policy
>  > effectiveness.
>  > High level design
>  >
>  > We identify two scenarios for a compiler using ML policies:
> development and
>  > release.
>  >
>  > The release scenario is equivalent to the regular compilation we have
> today
>  > - the only difference is that it uses a pre-trained model (trained in
> the
>  > development scenario beforehand) to make decisions instead of the
>  > heuristics. Determinism is guaranteed since the model in the release
>  > scenario is fixed. We imagine teams wishing to fine tune the
> effectiveness
>  > of the optimization to their scenarios would train a different model.
>  >
>  > The decision previously evaluated using a human-crafted heuristic is
>  > optionally replaced by:
>  >
>  >    -
>  >
>  >    a compiler-specific component, extracting features from IR (i.e. a
>  >    vector of values)
>  >    -
>  >
>  >    an evaluation of an ML model using those features, to obtain a
> result.
>  >    In ML nomenclature, this is referred to using the model for
> inference (as
>  >    opposed to training it)
>  >
>  > For example, when we replaced the decision of whether to inline a
> callsite,
>  > the ML model produces a boolean (inline/don’t inline) based on a
> features
>  > vector characterizing the call site and some broader module-wide
> context.
>  >
>  > Training/development is more complicated, and happens offline - akin
to
>  > how, today, attempts to improve an optimizing pass also happen
offline.
> A
>  > description of the high level design and the specifics we used for
the
>  > current scope are given in Appendix.
>  > Current Scope
>  >
>  > The goal of our first milestone was to evaluate end to end an
> integration
>  > of ML with LLVM, and get a first promising result. To that end, we
chose
>  > inlining for size (-Oz) as a stepping stone, as we perceived it to be
> more
>  > likely to require a simpler evaluation setup than
performance-oriented
>  > optimizations might. At this point, we only train whether a call site
> may
>  > be inlined or not, leaving the SCC traversal order as-is.
>  >
>  > We are proposing an initial change demonstrating the inference
mechanism
>  > using a pre-trained model, as a build-time opt-in to llvm. The
compiler
>  > components needed to perform training are also included in this first
>  > change. Subsequent changes would include more training-related
> components.
>  >
>  > At a high level, the changes we are proposing consist of:
>  >
>  >    1.
>  >
>  >    a new module analysis pass, InliningAdvisor. By default, its
>  >    implementation does nothing.
>  >    2.
>  >
>  >    minimal hooks into Inliner.cpp.
>  >    3.
>  >
>  >    the implementation of InliningAdvisor, activated when we opt-in
> ML. This
>  >    is available in Analysis/ML, together with:
>  >    1.
>  >
>  >       Rel/Dev specific ML model handing, also under Analysis/ML
>  >       2.
>  >
>  >       a pre-trained model for inlining for size
>  >       (Analysis/ML/models/inlining)
>  >       3.
>  >
>  >       a pre-trained model for predicting native size from IR
>  >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
>  >       4.
>  >
>  >    Some refactorings in PassBuilder, to allow opting into running
> mandatory
>  >    inlining first - some compilation speedup for the ML case,
minimal,
>  >    noise-like size effect. Also simplifies testing (these would be
> introduced
>  >    as a preliminary patch).
>  >
>  > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is
> more
>  > involved.
>  > Inference Opt-In Mechanism
>  >
>  > The feature is primarily controlled by the cmake flag
>  > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies.
The
>  > “Rel”ease case requires specifying the location of the pip tensorflow
>  > package (currently, that’s tf_nightly, and it should soon be
available
> in
>  > tensorflow)
>  >
>  > To opt in the ‘Rel’ case:
>  >
>  >    1.
>  >
>  >    install tensorflow pip package
>  >
>  > pip3 install tf_nightly --user
>  >
>  >    1.
>  >
>  >    configure llvm build
>  >
>  > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>  >
>  > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow
\
>  >
>  > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed -
it’s
>  > usually in the path>}
>  >
>  >
>  >
>  >    1.
>  >
>  >    build llvm as usual.
>  >    2.
>  >
>  >    pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first
to
>  >    clang.
>  >
>  > Details
>  >
>  > The ML model is captured as a TensorFlow ‘saved model’. When building
> llvm,
>  > we use TensorFlow’s  XLA native compiler (saved_model_cli) to compile
> the
>  > saved model into a native static library and a header file. Insofar
> as LLVM
>  > is concerned, there are minimal additional runtime requirements,
> packaged
>  > as source within the pip package: C++ wrappers around the compiled
> model.
>  > These will also be statically linked in the LLVM target. The compiled
> code
>  > is otherwise just a complex arithmetical computation, with no special
>  > requirements - it is single threaded and runs natively on the
targeted
>  > architecture. Together with the aforementioned runtime dependencies,
it
>  > adds ~115KB to the clang binary (0.08% increase)
>  >
>  > Runtime-wise, we observed a ~10% increase in the time spent in the
> inliner,
>  > for a large (33MB) binary IR module; inlining typically consumes
> ~10-15% of
>  > total compilation time, so the overall compile time overhead of the
>  > approach is arguably negligible. This cost is almost in entirety
>  > attributable to feature extraction.
>  >
>  > Memory-wise, the precompiled model has a fixed size buffer for its
> inputs,
>  > and performs a fixed amount of computations, so the memory overhead
>  > inherent to our approach is independent from the program being
> optimized.
>  > Using a small example to avoid effects such as memory use differences
> due
>  > to different inlinings, we observed an 300KB increase in the maximum
>  > resident size.
>  >
>  > A table showing effect on -Oz compiled binaries’ size is given in
> Appendix.
>  > Next directions
>  >
>  > Our next milestone has two main high level goals: detailing a
systematic
>  > approach to driving policy effectiveness; and exploring in depth the
> type
>  > of features and training algorithms most appropriate for compiler
> problems,
>  > or at least problems like inlining. For the latter, we expect
embedding
>  > more of the call graph structure to play an important role, as well
as,
>  > potentially, delegating the problem space traversal to the ML model.
>  >
>  > We plan to include inlining for speed as part of our work on these
> goals.
>  > AppendixTraining - High Level
>  >
>  > We use Reinforcement Learning (RL) to train the Inline-for-size
> model. At a
>  > high level, it is composed of 3 parts: training data collection,
model
>  > training, and iterative data collection/model training. We use
> TensorFlow
>  > as our ML framework.
>  >
>  > Related, we also needed to learn a separate model to evaluate the
native
>  > size of a function, given its IR, in order to calculate a more
precise
>  > reward for the reinforcement learning algorithm (“IR2Native”). We
> evaluated
>  > ‘just counting IR’ and TargetTransformInfo, but they appeared to
provide
>  > too noisy of a signal for the reward, insofar as the RL training
> algorithm
>  > for the inlining model was concerned. This model is only used during
>  > training.
>  >
>  > RL - Training data collection: the training data we need to feed into
a
>  > reinforcement learning algorithm are sequences consisting of: state
> of the
>  > problem (i.e. features); action (inline/not inline), and reward
(native
>  > size shrinkage after inline/not inline, using ir2native). To collect
the
>  > sequences, we hook the logging infrastructure into LLVM Inliner that
is
>  > able to produce logs after the inline optimization pass.
>  >
>  > RL - Model training: We use DQN (Deep Q-Network) to train our
>  > inlining-for-size ML policy. On a high level, the DQN algorithm
trains a
>  > neural network to predict the value of different actions --- the DQN
> policy
>  > then chooses to take the action with the highest predicted value. In
our
>  > scenario, we have two actions: 1) inline; 2) not inline, so the
neural
>  > network predicts the size reduction of these two actions based on
> features,
>  > and then decides to conduct inlining if the neural network believes
> doing
>  > inlining leads to higher size reduction. After the training finishes,
it
>  > produces a TensorFlow SavedModel that takes features as input and
> generates
>  > inline decisions (whether to inline or not) as output.
>  >
>  > The choice of the features and reward are essential in model
> training. The
>  > features are chosen with the consideration of being helpful in making
> the
>  > decision --- the input to the cost model is a good starting point.
> Ideally,
>  > the reward in the inline-for-size problem is the native size
shrinkage
>  > after inline/not inline. It is difficult to obtain precisely, so we
> use the
>  > estimate as an alternative. This means that, for training, we also
need
> a
>  > model (not necessarily machine learned) for estimating rewards.
>  >
>  > RL - Iterative data collection/model training: Reinforcement learning
is
>  > ideally an iterative model/policy improvement process that: 1) the
> trained
>  > model is deployed to the field to collect new data; 2) newly
> collected data
>  > are used to update the model. Thus, we need to do iterative data
>  > collection/model training. To facilitate data collection (avoid
complex
>  > build dependencies and time spent before/after the pass being
> trained), we
>  > isolate out IR modules captured right before the optimization we are
>  > interested in, and iterate on them with opt. We bootstrap the
> training from
>  > the current heuristic, and stop the process once we are happy with
the
>  > outcome.
>  >
>  > IR2Native: We collect IR features (different from the ones used for
>  > inlining) for each function at the end of inlining, right before we
> perform
>  > function simplification passes, and right after. This means we have
> two IR
>  > ‘shapes’ of the same function, and we know no further inlinings will
be
>  > performed, so whatever changes happen are based on that IR. We then
> extract
>  > the native size at the end of compilation. Together, this data forms
two
>  > records per function that can be used in supervised learning - the
> features
>  > are those extracted from IR, and the label is the native size.
Training
>  > IR2Native happens independently from the training of the inliner
model.
>  > Training support for the current scope
>  >
>  > The initial change includes the logging mechanism, an ir2native model
>  > trained for x86-64, and the means to rapidly iterate over development
ML
>  > models. For the components that will be included in subsequent
> changes, the
>  > rest of this section describes the mechanisms we employed. These
> components
>  > are detailed further below.
>  >
>  > To build LLVM with the ML policy in ‘Dev’ mode, we need the
tensorflow C
>  > API library <https://www.tensorflow.org/install/lang_c>. We
then
> configure
>  > the build:
>  >
>  > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>  >
>  > -DLLVM_TF_C_LIB=<path to unarchived package> \
>  >
>  > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
>  > library over, if it’s not on LD_LIBRARY_PATH}
>  >
>  >
>  > To extract IR right before inlining, we hacked on top of the ThinLTO
>  > infrastructure, interrupting its pre-link pipeline right before
> inlining.
>  > This means clang produces binary IR files captured at that stage. We
> then
>  > built a large target, obtaining a corpus of ~25K modules. We intend
to
>  > provide a clean mechanism in a subsequent change.
>  >
>  > To obtain features/labels for training this “IR to Native Size”
model,
> we
>  > had to make some changes to the AsmPrinter (to get real native sizes)
> and
>  > llvm-readobj, as well as add an analysis pass for extracting the IR
>  > features for this model. We plan on upstreaming these changes
> subsequently.
>  >
>  > Finally, the infrastructure driving the policy training is currently
> built
>  > on proprietary APIs, as it benefits from a distributed computing
>  > infrastructure. We are currently evaluating options for open sourcing
> it.
>  > In the meantime, we are presenting the high level implementation
> details.
>  >
>  > To train a new model, the infrastructure performs 2 steps: extracting
> the
>  > logs, and using them in a training algorithm.
>  >
>  > Log extraction is highly parallelizable: for each IR module in the
> training
>  > corpus, we need to run opt once (or a few times, when we explore
>  > improvements). Specifically, each run is this invocation:
>  >
>  > opt -passes=scc-oz-module-inliner
-ml-inliner-ir2native-model=<path to
>  > ir2native> -training-log=<path to training log output>
> -enable-ml-inliner
>  > -mandatory-inlinings-first -o <output> <module.o>
>  >
>  > Then collect the logs, and pass them to the next step.
>  >
>  >
>  > Experimental results
>  >
>  > Experimental results are available as follows:
>  >
>  >
>  >    -
>  >
>  >    SPEC2006
>  >
> <
>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
> >
>  >    binary sizes (-Oz) and ‘test run’ scores.
>  >    -
>  >
>  >    Size report
>  >
> <
>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
> >
>  >    from some internal benchmarks and binaries, including opt and
clang
>  >
>  >
>  > _______________________________________________
>  > LLVM Developers mailing list
>  > llvm-dev at lists.llvm.org
>  > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/bd2f145e/attachment-0001.html>

Mircea Trofin via llvm-dev

2020-Apr-08 21:34 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Unfortunately I needed to update the links. Here they are, hopefully these
work correctly.

SPEC2006
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true>

Internal benchmarks, clang, opt
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true>

Thanks!

On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com>
wrote:
> It turns out it's me, sorry. Let me see how I can sort this out. In the
> meantime, here is the csv:
>
> SPEC2006 data:
>
> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
> scores,perf: ML -Oz scores,ML improvement by
> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
> ,,,,,,,
> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%
>
>
> various benchmarks:
>
> binary,base -Oz size,ML -Oz size,ML shrink over base
> llvm:opt,66318624,62124256,6.32%
> math_1,6181552,5884144,4.81%
> event_management_1,4998728,4802696,3.92%
> llvm:lcalsBRaw,1270168,1222168,3.78%
> llvm:lcalsBLambda,1270232,1222232,3.78%
> llvm:lcalsARaw,1276248,1228248,3.76%
> llvm:lcalsALambda,1276440,1228504,3.76%
> llvm:lcalsCRaw,1278936,1231000,3.75%
> llvm:lcalsCLambda,1279064,1231256,3.74%
> llvm:Blur,1236888,1191192,3.69%
> image_processing_diffusion,1236120,1190488,3.69%
> llvm:Interpolation,1237208,1191576,3.69%
> llvm:harris,1237208,1191768,3.67%
> llvm:Dither,1238232,1192792,3.67%
> event_management_2,4741096,4568584,3.64%
> hashing_1,5189360,5004176,3.57%
> compression_2,5329392,5140496,3.54%
> math_2,6072336,5858992,3.51%
> crypto_1,4721576,4556008,3.51%
> math_4,27720208,26755568,3.48%
> math_3,27732496,26767920,3.48%
> infra_1,30673232,29630464,3.40%
> hashing_2,5834544,5637168,3.38%
> image_processing_entropy,5854960,5657008,3.38%
> hashing_3,5831088,5634288,3.38%
> memory_management_1,4957960,4790632,3.37%
> hasing_4,5816048,5619888,3.37%
> data_storage_2,5852976,5655792,3.37%
> compression_1,5971984,5771120,3.36%
> arena_unittest,6672944,6453584,3.29%
> data_storage_3,44113232,42710960,3.18%
> data_storage_1,22858992,22132960,3.18%
> datastructures_1,6362032,6161264,3.16%
> infra_2,56828720,55053496,3.12%
> datastructures_2,164558704,160790320,2.29%
> llvm:clang,77027416,75672088,1.76%
> benchmark_infra,20221424,19944432,1.37%
> search_1,169475360,167645632,1.08%
> protobuf_3,24071504,23895440,0.73%
> protobuf_1,24071504,23895504,0.73%
> protobuf_2,24071504,23895504,0.73%
> protobuf_4,24071504,23895504,0.73%
> protobuf_5,24071504,23895504,0.73%
> protobuf_6,23982800,23809424,0.72%
>
> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>>
>> Cool!
>>
>> I skipped to the end and tried to access the gdoc and the spreadsheet
>> but it did tell me I need permissions. Can you make them accessible or
>> am I the problem?
>>
>> Thanks,
>>    Johannes
>>
>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>>  > TL;DR; We can improve compiler optimizations driven by heuristics
by
>>  > replacing those heuristics with machine-learned policies (ML
models).
>>  > Policies are trained offline and ship as part of the compiler.
>> Determinism
>>  > is maintained because models are fixed when the compiler is
operating
>> in
>>  > production. Fine-tuning or regressions may be handled by
>> incorporating the
>>  > interesting cases in the ML training set, retraining the
compiler, and
>>  > redeploying it.
>>  >
>>  > For a first milestone, we chose inlining for size (-Oz) on
X86-64. We
>> were
>>  > able to train an ML model to produce binaries 1.5-6% smaller than
-Oz
>> of
>>  > tip-of-tree. The trained model appears to generalize well over a
>> diverse
>>  > set of binaries. Compile time is increased marginally (under 5%).
The
>> model
>>  > also happens to produce slightly better - performing code under
>> SPEC2006 -
>>  > total score improvement by 1.75%. As we only wanted to verify
there is
>> no
>>  > significant regression in SPEC, and given the milestone goals, we
>> haven’t
>>  > dug any deeper into the speed results.
>>  >
>>  > We see these results as promising, and as a reasonable point for
>>  > contributing our current work as a build-time opt-in to LLVM to
>> benefit the
>>  > community, in the hope of fostering collaboration and learning
from the
>>  > community’s feedback, as we try to better understand the
trade-offs
>> such an
>>  > approach entails, and as we work on expanding the depth and
breadth of
>>  > applying these techniques to compiler optimization problems.
>>  >
>>  > https://reviews.llvm.org/D77752
>>  > Motivation
>>  >
>>  > Optimization problems, such as inlining or register allocation,
are
>> hard:
>>  > we don’t know of algorithms that are guaranteed to produce the
optimum
>>  > solution in all cases in a timely fashion. Instead, we use
heuristics:
>>  > algorithms that we expect to produce some approximation of the
optimum,
>>  > with some expected degree of generality, in a practical amount of
time.
>>  >
>>  > Heuristics have some common characteristics. Taking inlining as a
case
>>  > study, it traverses the problem space in some way (bottom-up
traversal
>> of
>>  > the SCC graph), extracts some properties (let’s call them
“features”)
>> of
>>  > the program being optimized, and combine them with some weights
>> (“tune”),
>>  > to produce a cost (InlineCost), which allows for trade-off
analysis. We
>>  > validate the effectiveness of a heuristic over some accepted set
of
>>  > benchmarks. Over time, we react to regressions or pathological
cases
>>  > observed in the field, by manually analyzing such cases, figuring
out
>> an
>>  > enhancement to the heuristic, and then re-validating over that
set of
>>  > benchmarks (maybe augmented by adding the newly found cases).
>>  >
>>  > Because heuristics are code that needs to be maintained, there is
>> pressure
>>  > to reduce complexity: adding more features means we need to
reason
>> about
>>  > the interactions between the old and new features, which scales
>>  > combinatorially. Re-tuning because of the new features adds a
similar
>> kind
>>  > of complexity. Potentially, we miss out on optimization
improvements
>> as a
>>  > result.
>>  >
>>  > Because tuning is manual, there is pressure to keep the number of
>>  > benchmarks that can be studied in depth to a humanly-manageable
size,
>> which
>>  > potentially affects the generality of a heuristic or heuristic
tuning.
>>  >
>>  > The main advantage of manual heuristics is arguably their
relatively
>> lower
>>  > overhead: no additional dependencies and more transparent to
human
>> analysis
>>  > and improvement.
>>  >
>>  > Machine learning, in particular reinforcement learning, can
address the
>>  > tensions found in manual heuristics: once features are extracted
from
>> the
>>  > program, the way they are combined and tuned can easily be scaled
up
>>  > through automation, improving effectiveness and generality. A
major
>>  > drawback, at least at this point in time, of machine learning, is
that
>> we
>>  > don’t yet have a fully developed systematic approach for
improving
>> policy
>>  > effectiveness.
>>  > High level design
>>  >
>>  > We identify two scenarios for a compiler using ML policies:
>> development and
>>  > release.
>>  >
>>  > The release scenario is equivalent to the regular compilation we
have
>> today
>>  > - the only difference is that it uses a pre-trained model
(trained in
>> the
>>  > development scenario beforehand) to make decisions instead of the
>>  > heuristics. Determinism is guaranteed since the model in the
release
>>  > scenario is fixed. We imagine teams wishing to fine tune the
>> effectiveness
>>  > of the optimization to their scenarios would train a different
model.
>>  >
>>  > The decision previously evaluated using a human-crafted heuristic
is
>>  > optionally replaced by:
>>  >
>>  >    -
>>  >
>>  >    a compiler-specific component, extracting features from IR
(i.e. a
>>  >    vector of values)
>>  >    -
>>  >
>>  >    an evaluation of an ML model using those features, to obtain a
>> result.
>>  >    In ML nomenclature, this is referred to using the model for
>> inference (as
>>  >    opposed to training it)
>>  >
>>  > For example, when we replaced the decision of whether to inline a
>> callsite,
>>  > the ML model produces a boolean (inline/don’t inline) based on a
>> features
>>  > vector characterizing the call site and some broader module-wide
>> context.
>>  >
>>  > Training/development is more complicated, and happens offline -
akin to
>>  > how, today, attempts to improve an optimizing pass also happen
>> offline. A
>>  > description of the high level design and the specifics we used
for the
>>  > current scope are given in Appendix.
>>  > Current Scope
>>  >
>>  > The goal of our first milestone was to evaluate end to end an
>> integration
>>  > of ML with LLVM, and get a first promising result. To that end,
we
>> chose
>>  > inlining for size (-Oz) as a stepping stone, as we perceived it
to be
>> more
>>  > likely to require a simpler evaluation setup than
performance-oriented
>>  > optimizations might. At this point, we only train whether a call
site
>> may
>>  > be inlined or not, leaving the SCC traversal order as-is.
>>  >
>>  > We are proposing an initial change demonstrating the inference
>> mechanism
>>  > using a pre-trained model, as a build-time opt-in to llvm. The
compiler
>>  > components needed to perform training are also included in this
first
>>  > change. Subsequent changes would include more training-related
>> components.
>>  >
>>  > At a high level, the changes we are proposing consist of:
>>  >
>>  >    1.
>>  >
>>  >    a new module analysis pass, InliningAdvisor. By default, its
>>  >    implementation does nothing.
>>  >    2.
>>  >
>>  >    minimal hooks into Inliner.cpp.
>>  >    3.
>>  >
>>  >    the implementation of InliningAdvisor, activated when we
opt-in
>> ML. This
>>  >    is available in Analysis/ML, together with:
>>  >    1.
>>  >
>>  >       Rel/Dev specific ML model handing, also under Analysis/ML
>>  >       2.
>>  >
>>  >       a pre-trained model for inlining for size
>>  >       (Analysis/ML/models/inlining)
>>  >       3.
>>  >
>>  >       a pre-trained model for predicting native size from IR
>>  >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode
only.
>>  >       4.
>>  >
>>  >    Some refactorings in PassBuilder, to allow opting into running
>> mandatory
>>  >    inlining first - some compilation speedup for the ML case,
minimal,
>>  >    noise-like size effect. Also simplifies testing (these would
be
>> introduced
>>  >    as a preliminary patch).
>>  >
>>  > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it
is
>> more
>>  > involved.
>>  > Inference Opt-In Mechanism
>>  >
>>  > The feature is primarily controlled by the cmake flag
>>  > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different
dependencies. The
>>  > “Rel”ease case requires specifying the location of the pip
tensorflow
>>  > package (currently, that’s tf_nightly, and it should soon be
available
>> in
>>  > tensorflow)
>>  >
>>  > To opt in the ‘Rel’ case:
>>  >
>>  >    1.
>>  >
>>  >    install tensorflow pip package
>>  >
>>  > pip3 install tf_nightly --user
>>  >
>>  >    1.
>>  >
>>  >    configure llvm build
>>  >
>>  > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>>  >
>>  >
-DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
>>  >
>>  > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if
needed - it’s
>>  > usually in the path>}
>>  >
>>  >
>>  >
>>  >    1.
>>  >
>>  >    build llvm as usual.
>>  >    2.
>>  >
>>  >    pass -mllvm -enable-ml-inliner -mllvm
-mandatory-inlinings-first to
>>  >    clang.
>>  >
>>  > Details
>>  >
>>  > The ML model is captured as a TensorFlow ‘saved model’. When
building
>> llvm,
>>  > we use TensorFlow’s  XLA native compiler (saved_model_cli) to
compile
>> the
>>  > saved model into a native static library and a header file.
Insofar
>> as LLVM
>>  > is concerned, there are minimal additional runtime requirements,
>> packaged
>>  > as source within the pip package: C++ wrappers around the
compiled
>> model.
>>  > These will also be statically linked in the LLVM target. The
compiled
>> code
>>  > is otherwise just a complex arithmetical computation, with no
special
>>  > requirements - it is single threaded and runs natively on the
targeted
>>  > architecture. Together with the aforementioned runtime
dependencies, it
>>  > adds ~115KB to the clang binary (0.08% increase)
>>  >
>>  > Runtime-wise, we observed a ~10% increase in the time spent in
the
>> inliner,
>>  > for a large (33MB) binary IR module; inlining typically consumes
>> ~10-15% of
>>  > total compilation time, so the overall compile time overhead of
the
>>  > approach is arguably negligible. This cost is almost in entirety
>>  > attributable to feature extraction.
>>  >
>>  > Memory-wise, the precompiled model has a fixed size buffer for
its
>> inputs,
>>  > and performs a fixed amount of computations, so the memory
overhead
>>  > inherent to our approach is independent from the program being
>> optimized.
>>  > Using a small example to avoid effects such as memory use
differences
>> due
>>  > to different inlinings, we observed an 300KB increase in the
maximum
>>  > resident size.
>>  >
>>  > A table showing effect on -Oz compiled binaries’ size is given in
>> Appendix.
>>  > Next directions
>>  >
>>  > Our next milestone has two main high level goals: detailing a
>> systematic
>>  > approach to driving policy effectiveness; and exploring in depth
the
>> type
>>  > of features and training algorithms most appropriate for compiler
>> problems,
>>  > or at least problems like inlining. For the latter, we expect
embedding
>>  > more of the call graph structure to play an important role, as
well as,
>>  > potentially, delegating the problem space traversal to the ML
model.
>>  >
>>  > We plan to include inlining for speed as part of our work on
these
>> goals.
>>  > AppendixTraining - High Level
>>  >
>>  > We use Reinforcement Learning (RL) to train the Inline-for-size
>> model. At a
>>  > high level, it is composed of 3 parts: training data collection,
model
>>  > training, and iterative data collection/model training. We use
>> TensorFlow
>>  > as our ML framework.
>>  >
>>  > Related, we also needed to learn a separate model to evaluate the
>> native
>>  > size of a function, given its IR, in order to calculate a more
precise
>>  > reward for the reinforcement learning algorithm (“IR2Native”). We
>> evaluated
>>  > ‘just counting IR’ and TargetTransformInfo, but they appeared to
>> provide
>>  > too noisy of a signal for the reward, insofar as the RL training
>> algorithm
>>  > for the inlining model was concerned. This model is only used
during
>>  > training.
>>  >
>>  > RL - Training data collection: the training data we need to feed
into a
>>  > reinforcement learning algorithm are sequences consisting of:
state
>> of the
>>  > problem (i.e. features); action (inline/not inline), and reward
(native
>>  > size shrinkage after inline/not inline, using ir2native). To
collect
>> the
>>  > sequences, we hook the logging infrastructure into LLVM Inliner
that is
>>  > able to produce logs after the inline optimization pass.
>>  >
>>  > RL - Model training: We use DQN (Deep Q-Network) to train our
>>  > inlining-for-size ML policy. On a high level, the DQN algorithm
trains
>> a
>>  > neural network to predict the value of different actions --- the
DQN
>> policy
>>  > then chooses to take the action with the highest predicted value.
In
>> our
>>  > scenario, we have two actions: 1) inline; 2) not inline, so the
neural
>>  > network predicts the size reduction of these two actions based on
>> features,
>>  > and then decides to conduct inlining if the neural network
believes
>> doing
>>  > inlining leads to higher size reduction. After the training
finishes,
>> it
>>  > produces a TensorFlow SavedModel that takes features as input and
>> generates
>>  > inline decisions (whether to inline or not) as output.
>>  >
>>  > The choice of the features and reward are essential in model
>> training. The
>>  > features are chosen with the consideration of being helpful in
making
>> the
>>  > decision --- the input to the cost model is a good starting
point.
>> Ideally,
>>  > the reward in the inline-for-size problem is the native size
shrinkage
>>  > after inline/not inline. It is difficult to obtain precisely, so
we
>> use the
>>  > estimate as an alternative. This means that, for training, we
also
>> need a
>>  > model (not necessarily machine learned) for estimating rewards.
>>  >
>>  > RL - Iterative data collection/model training: Reinforcement
learning
>> is
>>  > ideally an iterative model/policy improvement process that: 1)
the
>> trained
>>  > model is deployed to the field to collect new data; 2) newly
>> collected data
>>  > are used to update the model. Thus, we need to do iterative data
>>  > collection/model training. To facilitate data collection (avoid
complex
>>  > build dependencies and time spent before/after the pass being
>> trained), we
>>  > isolate out IR modules captured right before the optimization we
are
>>  > interested in, and iterate on them with opt. We bootstrap the
>> training from
>>  > the current heuristic, and stop the process once we are happy
with the
>>  > outcome.
>>  >
>>  > IR2Native: We collect IR features (different from the ones used
for
>>  > inlining) for each function at the end of inlining, right before
we
>> perform
>>  > function simplification passes, and right after. This means we
have
>> two IR
>>  > ‘shapes’ of the same function, and we know no further inlinings
will be
>>  > performed, so whatever changes happen are based on that IR. We
then
>> extract
>>  > the native size at the end of compilation. Together, this data
forms
>> two
>>  > records per function that can be used in supervised learning -
the
>> features
>>  > are those extracted from IR, and the label is the native size.
Training
>>  > IR2Native happens independently from the training of the inliner
model.
>>  > Training support for the current scope
>>  >
>>  > The initial change includes the logging mechanism, an ir2native
model
>>  > trained for x86-64, and the means to rapidly iterate over
development
>> ML
>>  > models. For the components that will be included in subsequent
>> changes, the
>>  > rest of this section describes the mechanisms we employed. These
>> components
>>  > are detailed further below.
>>  >
>>  > To build LLVM with the ML policy in ‘Dev’ mode, we need the
tensorflow
>> C
>>  > API library <https://www.tensorflow.org/install/lang_c>. We
then
>> configure
>>  > the build:
>>  >
>>  > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>>  >
>>  > -DLLVM_TF_C_LIB=<path to unarchived package> \
>>  >
>>  > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow
shared
>>  > library over, if it’s not on LD_LIBRARY_PATH}
>>  >
>>  >
>>  > To extract IR right before inlining, we hacked on top of the
ThinLTO
>>  > infrastructure, interrupting its pre-link pipeline right before
>> inlining.
>>  > This means clang produces binary IR files captured at that stage.
We
>> then
>>  > built a large target, obtaining a corpus of ~25K modules. We
intend to
>>  > provide a clean mechanism in a subsequent change.
>>  >
>>  > To obtain features/labels for training this “IR to Native Size”
model,
>> we
>>  > had to make some changes to the AsmPrinter (to get real native
sizes)
>> and
>>  > llvm-readobj, as well as add an analysis pass for extracting the
IR
>>  > features for this model. We plan on upstreaming these changes
>> subsequently.
>>  >
>>  > Finally, the infrastructure driving the policy training is
currently
>> built
>>  > on proprietary APIs, as it benefits from a distributed computing
>>  > infrastructure. We are currently evaluating options for open
sourcing
>> it.
>>  > In the meantime, we are presenting the high level implementation
>> details.
>>  >
>>  > To train a new model, the infrastructure performs 2 steps:
extracting
>> the
>>  > logs, and using them in a training algorithm.
>>  >
>>  > Log extraction is highly parallelizable: for each IR module in
the
>> training
>>  > corpus, we need to run opt once (or a few times, when we explore
>>  > improvements). Specifically, each run is this invocation:
>>  >
>>  > opt -passes=scc-oz-module-inliner
-ml-inliner-ir2native-model=<path to
>>  > ir2native> -training-log=<path to training log output>
>> -enable-ml-inliner
>>  > -mandatory-inlinings-first -o <output> <module.o>
>>  >
>>  > Then collect the logs, and pass them to the next step.
>>  >
>>  >
>>  > Experimental results
>>  >
>>  > Experimental results are available as follows:
>>  >
>>  >
>>  >    -
>>  >
>>  >    SPEC2006
>>  >
>> <
>>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
>> >
>>  >    binary sizes (-Oz) and ‘test run’ scores.
>>  >    -
>>  >
>>  >    Size report
>>  >
>> <
>>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
>> >
>>  >    from some internal benchmarks and binaries, including opt and
clang
>>  >
>>  >
>>  > _______________________________________________
>>  > LLVM Developers mailing list
>>  > llvm-dev at lists.llvm.org
>>  > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/2fc26df2/attachment.html>

Mircea Trofin via llvm-dev

2020-Apr-09 02:05 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

+Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at
google.com> , our
team members from the ML side.

To avoid formatting issues, here is a link to the RFC
<https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>,
open to comments.

Thanks!

On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com>
wrote:
> Unfortunately I needed to update the links. Here they are, hopefully these
> work correctly.
>
> SPEC2006
>
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true>
>
> Internal benchmarks, clang, opt
>
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true>
>
> Thanks!
>
> On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com>
wrote:
>
>> It turns out it's me, sorry. Let me see how I can sort this out. In
the
>> meantime, here is the csv:
>>
>> SPEC2006 data:
>>
>> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
>> scores,perf: ML -Oz scores,ML improvement by
>> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
>> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
>> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
>> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
>> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
>> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
>> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
>> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
>> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
>> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
>> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
>> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
>> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
>> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
>> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
>> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
>> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
>> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
>> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
>> ,,,,,,,
>> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%
>>
>>
>> various benchmarks:
>>
>> binary,base -Oz size,ML -Oz size,ML shrink over base
>> llvm:opt,66318624,62124256,6.32%
>> math_1,6181552,5884144,4.81%
>> event_management_1,4998728,4802696,3.92%
>> llvm:lcalsBRaw,1270168,1222168,3.78%
>> llvm:lcalsBLambda,1270232,1222232,3.78%
>> llvm:lcalsARaw,1276248,1228248,3.76%
>> llvm:lcalsALambda,1276440,1228504,3.76%
>> llvm:lcalsCRaw,1278936,1231000,3.75%
>> llvm:lcalsCLambda,1279064,1231256,3.74%
>> llvm:Blur,1236888,1191192,3.69%
>> image_processing_diffusion,1236120,1190488,3.69%
>> llvm:Interpolation,1237208,1191576,3.69%
>> llvm:harris,1237208,1191768,3.67%
>> llvm:Dither,1238232,1192792,3.67%
>> event_management_2,4741096,4568584,3.64%
>> hashing_1,5189360,5004176,3.57%
>> compression_2,5329392,5140496,3.54%
>> math_2,6072336,5858992,3.51%
>> crypto_1,4721576,4556008,3.51%
>> math_4,27720208,26755568,3.48%
>> math_3,27732496,26767920,3.48%
>> infra_1,30673232,29630464,3.40%
>> hashing_2,5834544,5637168,3.38%
>> image_processing_entropy,5854960,5657008,3.38%
>> hashing_3,5831088,5634288,3.38%
>> memory_management_1,4957960,4790632,3.37%
>> hasing_4,5816048,5619888,3.37%
>> data_storage_2,5852976,5655792,3.37%
>> compression_1,5971984,5771120,3.36%
>> arena_unittest,6672944,6453584,3.29%
>> data_storage_3,44113232,42710960,3.18%
>> data_storage_1,22858992,22132960,3.18%
>> datastructures_1,6362032,6161264,3.16%
>> infra_2,56828720,55053496,3.12%
>> datastructures_2,164558704,160790320,2.29%
>> llvm:clang,77027416,75672088,1.76%
>> benchmark_infra,20221424,19944432,1.37%
>> search_1,169475360,167645632,1.08%
>> protobuf_3,24071504,23895440,0.73%
>> protobuf_1,24071504,23895504,0.73%
>> protobuf_2,24071504,23895504,0.73%
>> protobuf_4,24071504,23895504,0.73%
>> protobuf_5,24071504,23895504,0.73%
>> protobuf_6,23982800,23809424,0.72%
>>
>> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <
>> johannesdoerfert at gmail.com> wrote:
>>
>>>
>>> Cool!
>>>
>>> I skipped to the end and tried to access the gdoc and the
spreadsheet
>>> but it did tell me I need permissions. Can you make them accessible
or
>>> am I the problem?
>>>
>>> Thanks,
>>>    Johannes
>>>
>>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>>>  > TL;DR; We can improve compiler optimizations driven by
heuristics by
>>>  > replacing those heuristics with machine-learned policies (ML
models).
>>>  > Policies are trained offline and ship as part of the
compiler.
>>> Determinism
>>>  > is maintained because models are fixed when the compiler is
operating
>>> in
>>>  > production. Fine-tuning or regressions may be handled by
>>> incorporating the
>>>  > interesting cases in the ML training set, retraining the
compiler, and
>>>  > redeploying it.
>>>  >
>>>  > For a first milestone, we chose inlining for size (-Oz) on
X86-64. We
>>> were
>>>  > able to train an ML model to produce binaries 1.5-6% smaller
than -Oz
>>> of
>>>  > tip-of-tree. The trained model appears to generalize well
over a
>>> diverse
>>>  > set of binaries. Compile time is increased marginally (under
5%). The
>>> model
>>>  > also happens to produce slightly better - performing code
under
>>> SPEC2006 -
>>>  > total score improvement by 1.75%. As we only wanted to verify
there
>>> is no
>>>  > significant regression in SPEC, and given the milestone
goals, we
>>> haven’t
>>>  > dug any deeper into the speed results.
>>>  >
>>>  > We see these results as promising, and as a reasonable point
for
>>>  > contributing our current work as a build-time opt-in to LLVM
to
>>> benefit the
>>>  > community, in the hope of fostering collaboration and
learning from
>>> the
>>>  > community’s feedback, as we try to better understand the
trade-offs
>>> such an
>>>  > approach entails, and as we work on expanding the depth and
breadth of
>>>  > applying these techniques to compiler optimization problems.
>>>  >
>>>  > https://reviews.llvm.org/D77752
>>>  > Motivation
>>>  >
>>>  > Optimization problems, such as inlining or register
allocation, are
>>> hard:
>>>  > we don’t know of algorithms that are guaranteed to produce
the optimum
>>>  > solution in all cases in a timely fashion. Instead, we use
heuristics:
>>>  > algorithms that we expect to produce some approximation of
the
>>> optimum,
>>>  > with some expected degree of generality, in a practical
amount of
>>> time.
>>>  >
>>>  > Heuristics have some common characteristics. Taking inlining
as a case
>>>  > study, it traverses the problem space in some way (bottom-up
>>> traversal of
>>>  > the SCC graph), extracts some properties (let’s call them
“features”)
>>> of
>>>  > the program being optimized, and combine them with some
weights
>>> (“tune”),
>>>  > to produce a cost (InlineCost), which allows for trade-off
analysis.
>>> We
>>>  > validate the effectiveness of a heuristic over some accepted
set of
>>>  > benchmarks. Over time, we react to regressions or
pathological cases
>>>  > observed in the field, by manually analyzing such cases,
figuring out
>>> an
>>>  > enhancement to the heuristic, and then re-validating over
that set of
>>>  > benchmarks (maybe augmented by adding the newly found cases).
>>>  >
>>>  > Because heuristics are code that needs to be maintained,
there is
>>> pressure
>>>  > to reduce complexity: adding more features means we need to
reason
>>> about
>>>  > the interactions between the old and new features, which
scales
>>>  > combinatorially. Re-tuning because of the new features adds a
similar
>>> kind
>>>  > of complexity. Potentially, we miss out on optimization
improvements
>>> as a
>>>  > result.
>>>  >
>>>  > Because tuning is manual, there is pressure to keep the
number of
>>>  > benchmarks that can be studied in depth to a
humanly-manageable size,
>>> which
>>>  > potentially affects the generality of a heuristic or
heuristic tuning.
>>>  >
>>>  > The main advantage of manual heuristics is arguably their
relatively
>>> lower
>>>  > overhead: no additional dependencies and more transparent to
human
>>> analysis
>>>  > and improvement.
>>>  >
>>>  > Machine learning, in particular reinforcement learning, can
address
>>> the
>>>  > tensions found in manual heuristics: once features are
extracted from
>>> the
>>>  > program, the way they are combined and tuned can easily be
scaled up
>>>  > through automation, improving effectiveness and generality. A
major
>>>  > drawback, at least at this point in time, of machine
learning, is
>>> that we
>>>  > don’t yet have a fully developed systematic approach for
improving
>>> policy
>>>  > effectiveness.
>>>  > High level design
>>>  >
>>>  > We identify two scenarios for a compiler using ML policies:
>>> development and
>>>  > release.
>>>  >
>>>  > The release scenario is equivalent to the regular compilation
we have
>>> today
>>>  > - the only difference is that it uses a pre-trained model
(trained in
>>> the
>>>  > development scenario beforehand) to make decisions instead of
the
>>>  > heuristics. Determinism is guaranteed since the model in the
release
>>>  > scenario is fixed. We imagine teams wishing to fine tune the
>>> effectiveness
>>>  > of the optimization to their scenarios would train a
different model.
>>>  >
>>>  > The decision previously evaluated using a human-crafted
heuristic is
>>>  > optionally replaced by:
>>>  >
>>>  >    -
>>>  >
>>>  >    a compiler-specific component, extracting features from IR
(i.e. a
>>>  >    vector of values)
>>>  >    -
>>>  >
>>>  >    an evaluation of an ML model using those features, to
obtain a
>>> result.
>>>  >    In ML nomenclature, this is referred to using the model
for
>>> inference (as
>>>  >    opposed to training it)
>>>  >
>>>  > For example, when we replaced the decision of whether to
inline a
>>> callsite,
>>>  > the ML model produces a boolean (inline/don’t inline) based
on a
>>> features
>>>  > vector characterizing the call site and some broader
module-wide
>>> context.
>>>  >
>>>  > Training/development is more complicated, and happens offline
- akin
>>> to
>>>  > how, today, attempts to improve an optimizing pass also
happen
>>> offline. A
>>>  > description of the high level design and the specifics we
used for the
>>>  > current scope are given in Appendix.
>>>  > Current Scope
>>>  >
>>>  > The goal of our first milestone was to evaluate end to end an
>>> integration
>>>  > of ML with LLVM, and get a first promising result. To that
end, we
>>> chose
>>>  > inlining for size (-Oz) as a stepping stone, as we perceived
it to be
>>> more
>>>  > likely to require a simpler evaluation setup than
performance-oriented
>>>  > optimizations might. At this point, we only train whether a
call site
>>> may
>>>  > be inlined or not, leaving the SCC traversal order as-is.
>>>  >
>>>  > We are proposing an initial change demonstrating the
inference
>>> mechanism
>>>  > using a pre-trained model, as a build-time opt-in to llvm.
The
>>> compiler
>>>  > components needed to perform training are also included in
this first
>>>  > change. Subsequent changes would include more
training-related
>>> components.
>>>  >
>>>  > At a high level, the changes we are proposing consist of:
>>>  >
>>>  >    1.
>>>  >
>>>  >    a new module analysis pass, InliningAdvisor. By default,
its
>>>  >    implementation does nothing.
>>>  >    2.
>>>  >
>>>  >    minimal hooks into Inliner.cpp.
>>>  >    3.
>>>  >
>>>  >    the implementation of InliningAdvisor, activated when we
opt-in
>>> ML. This
>>>  >    is available in Analysis/ML, together with:
>>>  >    1.
>>>  >
>>>  >       Rel/Dev specific ML model handing, also under
Analysis/ML
>>>  >       2.
>>>  >
>>>  >       a pre-trained model for inlining for size
>>>  >       (Analysis/ML/models/inlining)
>>>  >       3.
>>>  >
>>>  >       a pre-trained model for predicting native size from IR
>>>  >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev
mode only.
>>>  >       4.
>>>  >
>>>  >    Some refactorings in PassBuilder, to allow opting into
running
>>> mandatory
>>>  >    inlining first - some compilation speedup for the ML case,
minimal,
>>>  >    noise-like size effect. Also simplifies testing (these
would be
>>> introduced
>>>  >    as a preliminary patch).
>>>  >
>>>  > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix,
as it is
>>> more
>>>  > involved.
>>>  > Inference Opt-In Mechanism
>>>  >
>>>  > The feature is primarily controlled by the cmake flag
>>>  > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different
dependencies. The
>>>  > “Rel”ease case requires specifying the location of the pip
tensorflow
>>>  > package (currently, that’s tf_nightly, and it should soon be
>>> available in
>>>  > tensorflow)
>>>  >
>>>  > To opt in the ‘Rel’ case:
>>>  >
>>>  >    1.
>>>  >
>>>  >    install tensorflow pip package
>>>  >
>>>  > pip3 install tf_nightly --user
>>>  >
>>>  >    1.
>>>  >
>>>  >    configure llvm build
>>>  >
>>>  > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>>>  >
>>>  >
-DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow
>>> \
>>>  >
>>>  > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if
needed -
>>> it’s
>>>  > usually in the path>}
>>>  >
>>>  >
>>>  >
>>>  >    1.
>>>  >
>>>  >    build llvm as usual.
>>>  >    2.
>>>  >
>>>  >    pass -mllvm -enable-ml-inliner -mllvm
-mandatory-inlinings-first to
>>>  >    clang.
>>>  >
>>>  > Details
>>>  >
>>>  > The ML model is captured as a TensorFlow ‘saved model’. When
building
>>> llvm,
>>>  > we use TensorFlow’s  XLA native compiler (saved_model_cli) to
compile
>>> the
>>>  > saved model into a native static library and a header file.
Insofar
>>> as LLVM
>>>  > is concerned, there are minimal additional runtime
requirements,
>>> packaged
>>>  > as source within the pip package: C++ wrappers around the
compiled
>>> model.
>>>  > These will also be statically linked in the LLVM target. The
compiled
>>> code
>>>  > is otherwise just a complex arithmetical computation, with no
special
>>>  > requirements - it is single threaded and runs natively on the
targeted
>>>  > architecture. Together with the aforementioned runtime
dependencies,
>>> it
>>>  > adds ~115KB to the clang binary (0.08% increase)
>>>  >
>>>  > Runtime-wise, we observed a ~10% increase in the time spent
in the
>>> inliner,
>>>  > for a large (33MB) binary IR module; inlining typically
consumes
>>> ~10-15% of
>>>  > total compilation time, so the overall compile time overhead
of the
>>>  > approach is arguably negligible. This cost is almost in
entirety
>>>  > attributable to feature extraction.
>>>  >
>>>  > Memory-wise, the precompiled model has a fixed size buffer
for its
>>> inputs,
>>>  > and performs a fixed amount of computations, so the memory
overhead
>>>  > inherent to our approach is independent from the program
being
>>> optimized.
>>>  > Using a small example to avoid effects such as memory use
differences
>>> due
>>>  > to different inlinings, we observed an 300KB increase in the
maximum
>>>  > resident size.
>>>  >
>>>  > A table showing effect on -Oz compiled binaries’ size is
given in
>>> Appendix.
>>>  > Next directions
>>>  >
>>>  > Our next milestone has two main high level goals: detailing a
>>> systematic
>>>  > approach to driving policy effectiveness; and exploring in
depth the
>>> type
>>>  > of features and training algorithms most appropriate for
compiler
>>> problems,
>>>  > or at least problems like inlining. For the latter, we expect
>>> embedding
>>>  > more of the call graph structure to play an important role,
as well
>>> as,
>>>  > potentially, delegating the problem space traversal to the ML
model.
>>>  >
>>>  > We plan to include inlining for speed as part of our work on
these
>>> goals.
>>>  > AppendixTraining - High Level
>>>  >
>>>  > We use Reinforcement Learning (RL) to train the
Inline-for-size
>>> model. At a
>>>  > high level, it is composed of 3 parts: training data
collection, model
>>>  > training, and iterative data collection/model training. We
use
>>> TensorFlow
>>>  > as our ML framework.
>>>  >
>>>  > Related, we also needed to learn a separate model to evaluate
the
>>> native
>>>  > size of a function, given its IR, in order to calculate a
more precise
>>>  > reward for the reinforcement learning algorithm
(“IR2Native”). We
>>> evaluated
>>>  > ‘just counting IR’ and TargetTransformInfo, but they appeared
to
>>> provide
>>>  > too noisy of a signal for the reward, insofar as the RL
training
>>> algorithm
>>>  > for the inlining model was concerned. This model is only used
during
>>>  > training.
>>>  >
>>>  > RL - Training data collection: the training data we need to
feed into
>>> a
>>>  > reinforcement learning algorithm are sequences consisting of:
state
>>> of the
>>>  > problem (i.e. features); action (inline/not inline), and
reward
>>> (native
>>>  > size shrinkage after inline/not inline, using ir2native). To
collect
>>> the
>>>  > sequences, we hook the logging infrastructure into LLVM
Inliner that
>>> is
>>>  > able to produce logs after the inline optimization pass.
>>>  >
>>>  > RL - Model training: We use DQN (Deep Q-Network) to train our
>>>  > inlining-for-size ML policy. On a high level, the DQN
algorithm
>>> trains a
>>>  > neural network to predict the value of different actions ---
the DQN
>>> policy
>>>  > then chooses to take the action with the highest predicted
value. In
>>> our
>>>  > scenario, we have two actions: 1) inline; 2) not inline, so
the neural
>>>  > network predicts the size reduction of these two actions
based on
>>> features,
>>>  > and then decides to conduct inlining if the neural network
believes
>>> doing
>>>  > inlining leads to higher size reduction. After the training
finishes,
>>> it
>>>  > produces a TensorFlow SavedModel that takes features as input
and
>>> generates
>>>  > inline decisions (whether to inline or not) as output.
>>>  >
>>>  > The choice of the features and reward are essential in model
>>> training. The
>>>  > features are chosen with the consideration of being helpful
in making
>>> the
>>>  > decision --- the input to the cost model is a good starting
point.
>>> Ideally,
>>>  > the reward in the inline-for-size problem is the native size
shrinkage
>>>  > after inline/not inline. It is difficult to obtain precisely,
so we
>>> use the
>>>  > estimate as an alternative. This means that, for training, we
also
>>> need a
>>>  > model (not necessarily machine learned) for estimating
rewards.
>>>  >
>>>  > RL - Iterative data collection/model training: Reinforcement
learning
>>> is
>>>  > ideally an iterative model/policy improvement process that:
1) the
>>> trained
>>>  > model is deployed to the field to collect new data; 2) newly
>>> collected data
>>>  > are used to update the model. Thus, we need to do iterative
data
>>>  > collection/model training. To facilitate data collection
(avoid
>>> complex
>>>  > build dependencies and time spent before/after the pass being
>>> trained), we
>>>  > isolate out IR modules captured right before the optimization
we are
>>>  > interested in, and iterate on them with opt. We bootstrap the
>>> training from
>>>  > the current heuristic, and stop the process once we are happy
with the
>>>  > outcome.
>>>  >
>>>  > IR2Native: We collect IR features (different from the ones
used for
>>>  > inlining) for each function at the end of inlining, right
before we
>>> perform
>>>  > function simplification passes, and right after. This means
we have
>>> two IR
>>>  > ‘shapes’ of the same function, and we know no further
inlinings will
>>> be
>>>  > performed, so whatever changes happen are based on that IR.
We then
>>> extract
>>>  > the native size at the end of compilation. Together, this
data forms
>>> two
>>>  > records per function that can be used in supervised learning
- the
>>> features
>>>  > are those extracted from IR, and the label is the native
size.
>>> Training
>>>  > IR2Native happens independently from the training of the
inliner
>>> model.
>>>  > Training support for the current scope
>>>  >
>>>  > The initial change includes the logging mechanism, an
ir2native model
>>>  > trained for x86-64, and the means to rapidly iterate over
development
>>> ML
>>>  > models. For the components that will be included in
subsequent
>>> changes, the
>>>  > rest of this section describes the mechanisms we employed.
These
>>> components
>>>  > are detailed further below.
>>>  >
>>>  > To build LLVM with the ML policy in ‘Dev’ mode, we need the
>>> tensorflow C
>>>  > API library
<https://www.tensorflow.org/install/lang_c>. We then
>>> configure
>>>  > the build:
>>>  >
>>>  > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>>>  >
>>>  > -DLLVM_TF_C_LIB=<path to unarchived package> \
>>>  >
>>>  > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow
shared
>>>  > library over, if it’s not on LD_LIBRARY_PATH}
>>>  >
>>>  >
>>>  > To extract IR right before inlining, we hacked on top of the
ThinLTO
>>>  > infrastructure, interrupting its pre-link pipeline right
before
>>> inlining.
>>>  > This means clang produces binary IR files captured at that
stage. We
>>> then
>>>  > built a large target, obtaining a corpus of ~25K modules. We
intend to
>>>  > provide a clean mechanism in a subsequent change.
>>>  >
>>>  > To obtain features/labels for training this “IR to Native
Size”
>>> model, we
>>>  > had to make some changes to the AsmPrinter (to get real
native sizes)
>>> and
>>>  > llvm-readobj, as well as add an analysis pass for extracting
the IR
>>>  > features for this model. We plan on upstreaming these changes
>>> subsequently.
>>>  >
>>>  > Finally, the infrastructure driving the policy training is
currently
>>> built
>>>  > on proprietary APIs, as it benefits from a distributed
computing
>>>  > infrastructure. We are currently evaluating options for open
sourcing
>>> it.
>>>  > In the meantime, we are presenting the high level
implementation
>>> details.
>>>  >
>>>  > To train a new model, the infrastructure performs 2 steps:
extracting
>>> the
>>>  > logs, and using them in a training algorithm.
>>>  >
>>>  > Log extraction is highly parallelizable: for each IR module
in the
>>> training
>>>  > corpus, we need to run opt once (or a few times, when we
explore
>>>  > improvements). Specifically, each run is this invocation:
>>>  >
>>>  > opt -passes=scc-oz-module-inliner
-ml-inliner-ir2native-model=<path to
>>>  > ir2native> -training-log=<path to training log
output>
>>> -enable-ml-inliner
>>>  > -mandatory-inlinings-first -o <output> <module.o>
>>>  >
>>>  > Then collect the logs, and pass them to the next step.
>>>  >
>>>  >
>>>  > Experimental results
>>>  >
>>>  > Experimental results are available as follows:
>>>  >
>>>  >
>>>  >    -
>>>  >
>>>  >    SPEC2006
>>>  >
>>> <
>>>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
>>> >
>>>  >    binary sizes (-Oz) and ‘test run’ scores.
>>>  >    -
>>>  >
>>>  >    Size report
>>>  >
>>> <
>>>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
>>> >
>>>  >    from some internal benchmarks and binaries, including opt
and clang
>>>  >
>>>  >
>>>  > _______________________________________________
>>>  > LLVM Developers mailing list
>>>  > llvm-dev at lists.llvm.org
>>>  > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/ee2cbefe/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Apr 2020 - RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Seemingly Similar Threads