thr3ads.net - llvm dev - [llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Mircea Trofin via llvm-dev

2020-Apr-08 21:04 UTC

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

TL;DR; We can improve compiler optimizations driven by heuristics by
replacing those heuristics with machine-learned policies (ML models).
Policies are trained offline and ship as part of the compiler. Determinism
is maintained because models are fixed when the compiler is operating in
production. Fine-tuning or regressions may be handled by incorporating the
interesting cases in the ML training set, retraining the compiler, and
redeploying it.

For a first milestone, we chose inlining for size (-Oz) on X86-64. We were
able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
tip-of-tree. The trained model appears to generalize well over a diverse
set of binaries. Compile time is increased marginally (under 5%). The model
also happens to produce slightly better - performing code under SPEC2006 -
total score improvement by 1.75%. As we only wanted to verify there is no
significant regression in SPEC, and given the milestone goals, we haven’t
dug any deeper into the speed results.

We see these results as promising, and as a reasonable point for
contributing our current work as a build-time opt-in to LLVM to benefit the
community, in the hope of fostering collaboration and learning from the
community’s feedback, as we try to better understand the trade-offs such an
approach entails, and as we work on expanding the depth and breadth of
applying these techniques to compiler optimization problems.

https://reviews.llvm.org/D77752
Motivation

Optimization problems, such as inlining or register allocation, are hard:
we don’t know of algorithms that are guaranteed to produce the optimum
solution in all cases in a timely fashion. Instead, we use heuristics:
algorithms that we expect to produce some approximation of the optimum,
with some expected degree of generality, in a practical amount of time.

Heuristics have some common characteristics. Taking inlining as a case
study, it traverses the problem space in some way (bottom-up traversal of
the SCC graph), extracts some properties (let’s call them “features”) of
the program being optimized, and combine them with some weights (“tune”),
to produce a cost (InlineCost), which allows for trade-off analysis. We
validate the effectiveness of a heuristic over some accepted set of
benchmarks. Over time, we react to regressions or pathological cases
observed in the field, by manually analyzing such cases, figuring out an
enhancement to the heuristic, and then re-validating over that set of
benchmarks (maybe augmented by adding the newly found cases).

Because heuristics are code that needs to be maintained, there is pressure
to reduce complexity: adding more features means we need to reason about
the interactions between the old and new features, which scales
combinatorially. Re-tuning because of the new features adds a similar kind
of complexity. Potentially, we miss out on optimization improvements as a
result.

Because tuning is manual, there is pressure to keep the number of
benchmarks that can be studied in depth to a humanly-manageable size, which
potentially affects the generality of a heuristic or heuristic tuning.

The main advantage of manual heuristics is arguably their relatively lower
overhead: no additional dependencies and more transparent to human analysis
and improvement.

Machine learning, in particular reinforcement learning, can address the
tensions found in manual heuristics: once features are extracted from the
program, the way they are combined and tuned can easily be scaled up
through automation, improving effectiveness and generality. A major
drawback, at least at this point in time, of machine learning, is that we
don’t yet have a fully developed systematic approach for improving policy
effectiveness.
High level design

We identify two scenarios for a compiler using ML policies: development and
release.

The release scenario is equivalent to the regular compilation we have today
- the only difference is that it uses a pre-trained model (trained in the
development scenario beforehand) to make decisions instead of the
heuristics. Determinism is guaranteed since the model in the release
scenario is fixed. We imagine teams wishing to fine tune the effectiveness
of the optimization to their scenarios would train a different model.

The decision previously evaluated using a human-crafted heuristic is
optionally replaced by:

a compiler-specific component, extracting features from IR (i.e. a
vector of values)
-

an evaluation of an ML model using those features, to obtain a result.
In ML nomenclature, this is referred to using the model for inference (as
opposed to training it)

For example, when we replaced the decision of whether to inline a callsite,
the ML model produces a boolean (inline/don’t inline) based on a features
vector characterizing the call site and some broader module-wide context.

Training/development is more complicated, and happens offline - akin to
how, today, attempts to improve an optimizing pass also happen offline. A
description of the high level design and the specifics we used for the
current scope are given in Appendix.
Current Scope

The goal of our first milestone was to evaluate end to end an integration
of ML with LLVM, and get a first promising result. To that end, we chose
inlining for size (-Oz) as a stepping stone, as we perceived it to be more
likely to require a simpler evaluation setup than performance-oriented
optimizations might. At this point, we only train whether a call site may
be inlined or not, leaving the SCC traversal order as-is.

We are proposing an initial change demonstrating the inference mechanism
using a pre-trained model, as a build-time opt-in to llvm. The compiler
components needed to perform training are also included in this first
change. Subsequent changes would include more training-related components.

At a high level, the changes we are proposing consist of:

a new module analysis pass, InliningAdvisor. By default, its
implementation does nothing.
2.

minimal hooks into Inliner.cpp.
3.

the implementation of InliningAdvisor, activated when we opt-in ML. This
is available in Analysis/ML, together with:
1.

Rel/Dev specific ML model handing, also under Analysis/ML
2.

a pre-trained model for inlining for size
(Analysis/ML/models/inlining)
3.

a pre-trained model for predicting native size from IR
(Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
4.

Some refactorings in PassBuilder, to allow opting into running mandatory
inlining first - some compilation speedup for the ML case, minimal,
noise-like size effect. Also simplifies testing (these would be introduced
as a preliminary patch).

We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
involved.
Inference Opt-In Mechanism

The feature is primarily controlled by the cmake flag
LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
“Rel”ease case requires specifying the location of the pip tensorflow
package (currently, that’s tf_nightly, and it should soon be available in
tensorflow)

To opt in the ‘Rel’ case:

install tensorflow pip package

pip3 install tf_nightly --user

configure llvm build

cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \

-DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \

{-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
usually in the path>}

build llvm as usual.
2.

pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
clang.

Details

The ML model is captured as a TensorFlow ‘saved model’. When building llvm,
we use TensorFlow’s XLA native compiler (saved_model_cli) to compile the
saved model into a native static library and a header file. Insofar as LLVM
is concerned, there are minimal additional runtime requirements, packaged
as source within the pip package: C++ wrappers around the compiled model.
These will also be statically linked in the LLVM target. The compiled code
is otherwise just a complex arithmetical computation, with no special
requirements - it is single threaded and runs natively on the targeted
architecture. Together with the aforementioned runtime dependencies, it
adds ~115KB to the clang binary (0.08% increase)

Runtime-wise, we observed a ~10% increase in the time spent in the inliner,
for a large (33MB) binary IR module; inlining typically consumes ~10-15% of
total compilation time, so the overall compile time overhead of the
approach is arguably negligible. This cost is almost in entirety
attributable to feature extraction.

Memory-wise, the precompiled model has a fixed size buffer for its inputs,
and performs a fixed amount of computations, so the memory overhead
inherent to our approach is independent from the program being optimized.
Using a small example to avoid effects such as memory use differences due
to different inlinings, we observed an 300KB increase in the maximum
resident size.

A table showing effect on -Oz compiled binaries’ size is given in Appendix.
Next directions

Our next milestone has two main high level goals: detailing a systematic
approach to driving policy effectiveness; and exploring in depth the type
of features and training algorithms most appropriate for compiler problems,
or at least problems like inlining. For the latter, we expect embedding
more of the call graph structure to play an important role, as well as,
potentially, delegating the problem space traversal to the ML model.

We plan to include inlining for speed as part of our work on these goals.
AppendixTraining - High Level

We use Reinforcement Learning (RL) to train the Inline-for-size model. At a
high level, it is composed of 3 parts: training data collection, model
training, and iterative data collection/model training. We use TensorFlow
as our ML framework.

Related, we also needed to learn a separate model to evaluate the native
size of a function, given its IR, in order to calculate a more precise
reward for the reinforcement learning algorithm (“IR2Native”). We evaluated
‘just counting IR’ and TargetTransformInfo, but they appeared to provide
too noisy of a signal for the reward, insofar as the RL training algorithm
for the inlining model was concerned. This model is only used during
training.

RL - Training data collection: the training data we need to feed into a
reinforcement learning algorithm are sequences consisting of: state of the
problem (i.e. features); action (inline/not inline), and reward (native
size shrinkage after inline/not inline, using ir2native). To collect the
sequences, we hook the logging infrastructure into LLVM Inliner that is
able to produce logs after the inline optimization pass.

RL - Model training: We use DQN (Deep Q-Network) to train our
inlining-for-size ML policy. On a high level, the DQN algorithm trains a
neural network to predict the value of different actions --- the DQN policy
then chooses to take the action with the highest predicted value. In our
scenario, we have two actions: 1) inline; 2) not inline, so the neural
network predicts the size reduction of these two actions based on features,
and then decides to conduct inlining if the neural network believes doing
inlining leads to higher size reduction. After the training finishes, it
produces a TensorFlow SavedModel that takes features as input and generates
inline decisions (whether to inline or not) as output.

The choice of the features and reward are essential in model training. The
features are chosen with the consideration of being helpful in making the
decision --- the input to the cost model is a good starting point. Ideally,
the reward in the inline-for-size problem is the native size shrinkage
after inline/not inline. It is difficult to obtain precisely, so we use the
estimate as an alternative. This means that, for training, we also need a
model (not necessarily machine learned) for estimating rewards.

RL - Iterative data collection/model training: Reinforcement learning is
ideally an iterative model/policy improvement process that: 1) the trained
model is deployed to the field to collect new data; 2) newly collected data
are used to update the model. Thus, we need to do iterative data
collection/model training. To facilitate data collection (avoid complex
build dependencies and time spent before/after the pass being trained), we
isolate out IR modules captured right before the optimization we are
interested in, and iterate on them with opt. We bootstrap the training from
the current heuristic, and stop the process once we are happy with the
outcome.

IR2Native: We collect IR features (different from the ones used for
inlining) for each function at the end of inlining, right before we perform
function simplification passes, and right after. This means we have two IR
‘shapes’ of the same function, and we know no further inlinings will be
performed, so whatever changes happen are based on that IR. We then extract
the native size at the end of compilation. Together, this data forms two
records per function that can be used in supervised learning - the features
are those extracted from IR, and the label is the native size. Training
IR2Native happens independently from the training of the inliner model.
Training support for the current scope

The initial change includes the logging mechanism, an ir2native model
trained for x86-64, and the means to rapidly iterate over development ML
models. For the components that will be included in subsequent changes, the
rest of this section describes the mechanisms we employed. These components
are detailed further below.

To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
API library <https://www.tensorflow.org/install/lang_c>. We then configure
the build:

cmake .. -DLLVM_USE_ML_POLICY=Dev \

-DLLVM_TF_C_LIB=<path to unarchived package> \

{-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
library over, if it’s not on LD_LIBRARY_PATH}

To extract IR right before inlining, we hacked on top of the ThinLTO
infrastructure, interrupting its pre-link pipeline right before inlining.
This means clang produces binary IR files captured at that stage. We then
built a large target, obtaining a corpus of ~25K modules. We intend to
provide a clean mechanism in a subsequent change.

To obtain features/labels for training this “IR to Native Size” model, we
had to make some changes to the AsmPrinter (to get real native sizes) and
llvm-readobj, as well as add an analysis pass for extracting the IR
features for this model. We plan on upstreaming these changes subsequently.

Finally, the infrastructure driving the policy training is currently built
on proprietary APIs, as it benefits from a distributed computing
infrastructure. We are currently evaluating options for open sourcing it.
In the meantime, we are presenting the high level implementation details.

To train a new model, the infrastructure performs 2 steps: extracting the
logs, and using them in a training algorithm.

Log extraction is highly parallelizable: for each IR module in the training
corpus, we need to run opt once (or a few times, when we explore
improvements). Specifically, each run is this invocation:

opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
ir2native> -training-log=<path to training log output>
-enable-ml-inliner
-mandatory-inlinings-first -o <output> <module.o>

Then collect the logs, and pass them to the next step.

Experimental results

Experimental results are available as follows:

SPEC2006

<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
binary sizes (-Oz) and ‘test run’ scores.
-

Size report

<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
from some internal benchmarks and binaries, including opt and clang
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/ef23d1ed/attachment-0001.html>

Johannes Doerfert via llvm-dev

2020-Apr-08 21:10 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Cool!

I skipped to the end and tried to access the gdoc and the spreadsheet
but it did tell me I need permissions. Can you make them accessible or
am I the problem?

Thanks,
   Johannes

On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
 > TL;DR; We can improve compiler optimizations driven by heuristics by
 > replacing those heuristics with machine-learned policies (ML models).
 > Policies are trained offline and ship as part of the compiler. 
Determinism
 > is maintained because models are fixed when the compiler is operating in
 > production. Fine-tuning or regressions may be handled by 
incorporating the
 > interesting cases in the ML training set, retraining the compiler, and
 > redeploying it.
 >
 > For a first milestone, we chose inlining for size (-Oz) on X86-64. We 
were
 > able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
 > tip-of-tree. The trained model appears to generalize well over a diverse
 > set of binaries. Compile time is increased marginally (under 5%). The 
model
 > also happens to produce slightly better - performing code under 
SPEC2006 -
 > total score improvement by 1.75%. As we only wanted to verify there is no
 > significant regression in SPEC, and given the milestone goals, we haven’t
 > dug any deeper into the speed results.
 >
 > We see these results as promising, and as a reasonable point for
 > contributing our current work as a build-time opt-in to LLVM to 
benefit the
 > community, in the hope of fostering collaboration and learning from the
 > community’s feedback, as we try to better understand the trade-offs 
such an
 > approach entails, and as we work on expanding the depth and breadth of
 > applying these techniques to compiler optimization problems.
 >
 > https://reviews.llvm.org/D77752
 > Motivation
 >
 > Optimization problems, such as inlining or register allocation, are hard:
 > we don’t know of algorithms that are guaranteed to produce the optimum
 > solution in all cases in a timely fashion. Instead, we use heuristics:
 > algorithms that we expect to produce some approximation of the optimum,
 > with some expected degree of generality, in a practical amount of time.
 >
 > Heuristics have some common characteristics. Taking inlining as a case
 > study, it traverses the problem space in some way (bottom-up traversal of
 > the SCC graph), extracts some properties (let’s call them “features”) of
 > the program being optimized, and combine them with some weights (“tune”),
 > to produce a cost (InlineCost), which allows for trade-off analysis. We
 > validate the effectiveness of a heuristic over some accepted set of
 > benchmarks. Over time, we react to regressions or pathological cases
 > observed in the field, by manually analyzing such cases, figuring out an
 > enhancement to the heuristic, and then re-validating over that set of
 > benchmarks (maybe augmented by adding the newly found cases).
 >
 > Because heuristics are code that needs to be maintained, there is 
pressure
 > to reduce complexity: adding more features means we need to reason about
 > the interactions between the old and new features, which scales
 > combinatorially. Re-tuning because of the new features adds a similar 
kind
 > of complexity. Potentially, we miss out on optimization improvements as a
 > result.
 >
 > Because tuning is manual, there is pressure to keep the number of
 > benchmarks that can be studied in depth to a humanly-manageable size, 
which
 > potentially affects the generality of a heuristic or heuristic tuning.
 >
 > The main advantage of manual heuristics is arguably their relatively 
lower
 > overhead: no additional dependencies and more transparent to human 
analysis
 > and improvement.
 >
 > Machine learning, in particular reinforcement learning, can address the
 > tensions found in manual heuristics: once features are extracted from the
 > program, the way they are combined and tuned can easily be scaled up
 > through automation, improving effectiveness and generality. A major
 > drawback, at least at this point in time, of machine learning, is that we
 > don’t yet have a fully developed systematic approach for improving policy
 > effectiveness.
 > High level design
 >
 > We identify two scenarios for a compiler using ML policies: 
development and
 > release.
 >
 > The release scenario is equivalent to the regular compilation we have 
today
 > - the only difference is that it uses a pre-trained model (trained in the
 > development scenario beforehand) to make decisions instead of the
 > heuristics. Determinism is guaranteed since the model in the release
 > scenario is fixed. We imagine teams wishing to fine tune the 
effectiveness
 > of the optimization to their scenarios would train a different model.
 >
 > The decision previously evaluated using a human-crafted heuristic is
 > optionally replaced by:
 >
 >    -
 >
 >    a compiler-specific component, extracting features from IR (i.e. a
 >    vector of values)
 >    -
 >
 >    an evaluation of an ML model using those features, to obtain a result.
 >    In ML nomenclature, this is referred to using the model for 
inference (as
 >    opposed to training it)
 >
 > For example, when we replaced the decision of whether to inline a 
callsite,
 > the ML model produces a boolean (inline/don’t inline) based on a features
 > vector characterizing the call site and some broader module-wide context.
 >
 > Training/development is more complicated, and happens offline - akin to
 > how, today, attempts to improve an optimizing pass also happen offline. A
 > description of the high level design and the specifics we used for the
 > current scope are given in Appendix.
 > Current Scope
 >
 > The goal of our first milestone was to evaluate end to end an integration
 > of ML with LLVM, and get a first promising result. To that end, we chose
 > inlining for size (-Oz) as a stepping stone, as we perceived it to be 
more
 > likely to require a simpler evaluation setup than performance-oriented
 > optimizations might. At this point, we only train whether a call site may
 > be inlined or not, leaving the SCC traversal order as-is.
 >
 > We are proposing an initial change demonstrating the inference mechanism
 > using a pre-trained model, as a build-time opt-in to llvm. The compiler
 > components needed to perform training are also included in this first
 > change. Subsequent changes would include more training-related 
components.
 >
 > At a high level, the changes we are proposing consist of:
 >
 >    1.
 >
 >    a new module analysis pass, InliningAdvisor. By default, its
 >    implementation does nothing.
 >    2.
 >
 >    minimal hooks into Inliner.cpp.
 >    3.
 >
 >    the implementation of InliningAdvisor, activated when we opt-in 
ML. This
 >    is available in Analysis/ML, together with:
 >    1.
 >
 >       Rel/Dev specific ML model handing, also under Analysis/ML
 >       2.
 >
 >       a pre-trained model for inlining for size
 >       (Analysis/ML/models/inlining)
 >       3.
 >
 >       a pre-trained model for predicting native size from IR
 >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
 >       4.
 >
 >    Some refactorings in PassBuilder, to allow opting into running 
mandatory
 >    inlining first - some compilation speedup for the ML case, minimal,
 >    noise-like size effect. Also simplifies testing (these would be 
introduced
 >    as a preliminary patch).
 >
 > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
 > involved.
 > Inference Opt-In Mechanism
 >
 > The feature is primarily controlled by the cmake flag
 > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
 > “Rel”ease case requires specifying the location of the pip tensorflow
 > package (currently, that’s tf_nightly, and it should soon be available in
 > tensorflow)
 >
 > To opt in the ‘Rel’ case:
 >
 >    1.
 >
 >    install tensorflow pip package
 >
 > pip3 install tf_nightly --user
 >
 >    1.
 >
 >    configure llvm build
 >
 > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
 >
 > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
 >
 > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
 > usually in the path>}
 >
 >
 >
 >    1.
 >
 >    build llvm as usual.
 >    2.
 >
 >    pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
 >    clang.
 >
 > Details
 >
 > The ML model is captured as a TensorFlow ‘saved model’. When building 
llvm,
 > we use TensorFlow’s  XLA native compiler (saved_model_cli) to compile the
 > saved model into a native static library and a header file. Insofar 
as LLVM
 > is concerned, there are minimal additional runtime requirements, packaged
 > as source within the pip package: C++ wrappers around the compiled model.
 > These will also be statically linked in the LLVM target. The compiled 
code
 > is otherwise just a complex arithmetical computation, with no special
 > requirements - it is single threaded and runs natively on the targeted
 > architecture. Together with the aforementioned runtime dependencies, it
 > adds ~115KB to the clang binary (0.08% increase)
 >
 > Runtime-wise, we observed a ~10% increase in the time spent in the 
inliner,
 > for a large (33MB) binary IR module; inlining typically consumes 
~10-15% of
 > total compilation time, so the overall compile time overhead of the
 > approach is arguably negligible. This cost is almost in entirety
 > attributable to feature extraction.
 >
 > Memory-wise, the precompiled model has a fixed size buffer for its 
inputs,
 > and performs a fixed amount of computations, so the memory overhead
 > inherent to our approach is independent from the program being optimized.
 > Using a small example to avoid effects such as memory use differences due
 > to different inlinings, we observed an 300KB increase in the maximum
 > resident size.
 >
 > A table showing effect on -Oz compiled binaries’ size is given in 
Appendix.
 > Next directions
 >
 > Our next milestone has two main high level goals: detailing a systematic
 > approach to driving policy effectiveness; and exploring in depth the type
 > of features and training algorithms most appropriate for compiler 
problems,
 > or at least problems like inlining. For the latter, we expect embedding
 > more of the call graph structure to play an important role, as well as,
 > potentially, delegating the problem space traversal to the ML model.
 >
 > We plan to include inlining for speed as part of our work on these goals.
 > AppendixTraining - High Level
 >
 > We use Reinforcement Learning (RL) to train the Inline-for-size 
model. At a
 > high level, it is composed of 3 parts: training data collection, model
 > training, and iterative data collection/model training. We use TensorFlow
 > as our ML framework.
 >
 > Related, we also needed to learn a separate model to evaluate the native
 > size of a function, given its IR, in order to calculate a more precise
 > reward for the reinforcement learning algorithm (“IR2Native”). We 
evaluated
 > ‘just counting IR’ and TargetTransformInfo, but they appeared to provide
 > too noisy of a signal for the reward, insofar as the RL training 
algorithm
 > for the inlining model was concerned. This model is only used during
 > training.
 >
 > RL - Training data collection: the training data we need to feed into a
 > reinforcement learning algorithm are sequences consisting of: state 
of the
 > problem (i.e. features); action (inline/not inline), and reward (native
 > size shrinkage after inline/not inline, using ir2native). To collect the
 > sequences, we hook the logging infrastructure into LLVM Inliner that is
 > able to produce logs after the inline optimization pass.
 >
 > RL - Model training: We use DQN (Deep Q-Network) to train our
 > inlining-for-size ML policy. On a high level, the DQN algorithm trains a
 > neural network to predict the value of different actions --- the DQN 
policy
 > then chooses to take the action with the highest predicted value. In our
 > scenario, we have two actions: 1) inline; 2) not inline, so the neural
 > network predicts the size reduction of these two actions based on 
features,
 > and then decides to conduct inlining if the neural network believes doing
 > inlining leads to higher size reduction. After the training finishes, it
 > produces a TensorFlow SavedModel that takes features as input and 
generates
 > inline decisions (whether to inline or not) as output.
 >
 > The choice of the features and reward are essential in model 
training. The
 > features are chosen with the consideration of being helpful in making the
 > decision --- the input to the cost model is a good starting point. 
Ideally,
 > the reward in the inline-for-size problem is the native size shrinkage
 > after inline/not inline. It is difficult to obtain precisely, so we 
use the
 > estimate as an alternative. This means that, for training, we also need a
 > model (not necessarily machine learned) for estimating rewards.
 >
 > RL - Iterative data collection/model training: Reinforcement learning is
 > ideally an iterative model/policy improvement process that: 1) the 
trained
 > model is deployed to the field to collect new data; 2) newly 
collected data
 > are used to update the model. Thus, we need to do iterative data
 > collection/model training. To facilitate data collection (avoid complex
 > build dependencies and time spent before/after the pass being 
trained), we
 > isolate out IR modules captured right before the optimization we are
 > interested in, and iterate on them with opt. We bootstrap the 
training from
 > the current heuristic, and stop the process once we are happy with the
 > outcome.
 >
 > IR2Native: We collect IR features (different from the ones used for
 > inlining) for each function at the end of inlining, right before we 
perform
 > function simplification passes, and right after. This means we have 
two IR
 > ‘shapes’ of the same function, and we know no further inlinings will be
 > performed, so whatever changes happen are based on that IR. We then 
extract
 > the native size at the end of compilation. Together, this data forms two
 > records per function that can be used in supervised learning - the 
features
 > are those extracted from IR, and the label is the native size. Training
 > IR2Native happens independently from the training of the inliner model.
 > Training support for the current scope
 >
 > The initial change includes the logging mechanism, an ir2native model
 > trained for x86-64, and the means to rapidly iterate over development ML
 > models. For the components that will be included in subsequent 
changes, the
 > rest of this section describes the mechanisms we employed. These 
components
 > are detailed further below.
 >
 > To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
 > API library <https://www.tensorflow.org/install/lang_c>. We then 
configure
 > the build:
 >
 > cmake .. -DLLVM_USE_ML_POLICY=Dev \
 >
 > -DLLVM_TF_C_LIB=<path to unarchived package> \
 >
 > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
 > library over, if it’s not on LD_LIBRARY_PATH}
 >
 >
 > To extract IR right before inlining, we hacked on top of the ThinLTO
 > infrastructure, interrupting its pre-link pipeline right before inlining.
 > This means clang produces binary IR files captured at that stage. We then
 > built a large target, obtaining a corpus of ~25K modules. We intend to
 > provide a clean mechanism in a subsequent change.
 >
 > To obtain features/labels for training this “IR to Native Size” model, we
 > had to make some changes to the AsmPrinter (to get real native sizes) and
 > llvm-readobj, as well as add an analysis pass for extracting the IR
 > features for this model. We plan on upstreaming these changes 
subsequently.
 >
 > Finally, the infrastructure driving the policy training is currently 
built
 > on proprietary APIs, as it benefits from a distributed computing
 > infrastructure. We are currently evaluating options for open sourcing it.
 > In the meantime, we are presenting the high level implementation details.
 >
 > To train a new model, the infrastructure performs 2 steps: extracting the
 > logs, and using them in a training algorithm.
 >
 > Log extraction is highly parallelizable: for each IR module in the 
training
 > corpus, we need to run opt once (or a few times, when we explore
 > improvements). Specifically, each run is this invocation:
 >
 > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
 > ir2native> -training-log=<path to training log output>
-enable-ml-inliner
 > -mandatory-inlinings-first -o <output> <module.o>
 >
 > Then collect the logs, and pass them to the next step.
 >
 >
 > Experimental results
 >
 > Experimental results are available as follows:
 >
 >
 >    -
 >
 >    SPEC2006
 > 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
 >    binary sizes (-Oz) and ‘test run’ scores.
 >    -
 >
 >    Size report
 > 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
 >    from some internal benchmarks and binaries, including opt and clang
 >
 >
 > _______________________________________________
 > LLVM Developers mailing list
 > llvm-dev at lists.llvm.org
 > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mircea Trofin via llvm-dev

2020-Apr-08 21:23 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

It turns out it's me, sorry. Let me see how I can sort this out. In the
meantime, here is the csv:

SPEC2006 data:

binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
scores,perf: ML -Oz scores,ML improvement by
400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
,,,,,,,
SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%


various benchmarks:

binary,base -Oz size,ML -Oz size,ML shrink over base
llvm:opt,66318624,62124256,6.32%
math_1,6181552,5884144,4.81%
event_management_1,4998728,4802696,3.92%
llvm:lcalsBRaw,1270168,1222168,3.78%
llvm:lcalsBLambda,1270232,1222232,3.78%
llvm:lcalsARaw,1276248,1228248,3.76%
llvm:lcalsALambda,1276440,1228504,3.76%
llvm:lcalsCRaw,1278936,1231000,3.75%
llvm:lcalsCLambda,1279064,1231256,3.74%
llvm:Blur,1236888,1191192,3.69%
image_processing_diffusion,1236120,1190488,3.69%
llvm:Interpolation,1237208,1191576,3.69%
llvm:harris,1237208,1191768,3.67%
llvm:Dither,1238232,1192792,3.67%
event_management_2,4741096,4568584,3.64%
hashing_1,5189360,5004176,3.57%
compression_2,5329392,5140496,3.54%
math_2,6072336,5858992,3.51%
crypto_1,4721576,4556008,3.51%
math_4,27720208,26755568,3.48%
math_3,27732496,26767920,3.48%
infra_1,30673232,29630464,3.40%
hashing_2,5834544,5637168,3.38%
image_processing_entropy,5854960,5657008,3.38%
hashing_3,5831088,5634288,3.38%
memory_management_1,4957960,4790632,3.37%
hasing_4,5816048,5619888,3.37%
data_storage_2,5852976,5655792,3.37%
compression_1,5971984,5771120,3.36%
arena_unittest,6672944,6453584,3.29%
data_storage_3,44113232,42710960,3.18%
data_storage_1,22858992,22132960,3.18%
datastructures_1,6362032,6161264,3.16%
infra_2,56828720,55053496,3.12%
datastructures_2,164558704,160790320,2.29%
llvm:clang,77027416,75672088,1.76%
benchmark_infra,20221424,19944432,1.37%
search_1,169475360,167645632,1.08%
protobuf_3,24071504,23895440,0.73%
protobuf_1,24071504,23895504,0.73%
protobuf_2,24071504,23895504,0.73%
protobuf_4,24071504,23895504,0.73%
protobuf_5,24071504,23895504,0.73%
protobuf_6,23982800,23809424,0.72%

On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <johannesdoerfert at
gmail.com>
wrote:
>
> Cool!
>
> I skipped to the end and tried to access the gdoc and the spreadsheet
> but it did tell me I need permissions. Can you make them accessible or
> am I the problem?
>
> Thanks,
>    Johannes
>
> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>  > TL;DR; We can improve compiler optimizations driven by heuristics by
>  > replacing those heuristics with machine-learned policies (ML models).
>  > Policies are trained offline and ship as part of the compiler.
> Determinism
>  > is maintained because models are fixed when the compiler is operating
in
>  > production. Fine-tuning or regressions may be handled by
> incorporating the
>  > interesting cases in the ML training set, retraining the compiler,
and
>  > redeploying it.
>  >
>  > For a first milestone, we chose inlining for size (-Oz) on X86-64. We
> were
>  > able to train an ML model to produce binaries 1.5-6% smaller than -Oz
of
>  > tip-of-tree. The trained model appears to generalize well over a
diverse
>  > set of binaries. Compile time is increased marginally (under 5%). The
> model
>  > also happens to produce slightly better - performing code under
> SPEC2006 -
>  > total score improvement by 1.75%. As we only wanted to verify there
is
> no
>  > significant regression in SPEC, and given the milestone goals, we
> haven’t
>  > dug any deeper into the speed results.
>  >
>  > We see these results as promising, and as a reasonable point for
>  > contributing our current work as a build-time opt-in to LLVM to
> benefit the
>  > community, in the hope of fostering collaboration and learning from
the
>  > community’s feedback, as we try to better understand the trade-offs
> such an
>  > approach entails, and as we work on expanding the depth and breadth
of
>  > applying these techniques to compiler optimization problems.
>  >
>  > https://reviews.llvm.org/D77752
>  > Motivation
>  >
>  > Optimization problems, such as inlining or register allocation, are
> hard:
>  > we don’t know of algorithms that are guaranteed to produce the
optimum
>  > solution in all cases in a timely fashion. Instead, we use
heuristics:
>  > algorithms that we expect to produce some approximation of the
optimum,
>  > with some expected degree of generality, in a practical amount of
time.
>  >
>  > Heuristics have some common characteristics. Taking inlining as a
case
>  > study, it traverses the problem space in some way (bottom-up
traversal
> of
>  > the SCC graph), extracts some properties (let’s call them “features”)
of
>  > the program being optimized, and combine them with some weights
> (“tune”),
>  > to produce a cost (InlineCost), which allows for trade-off analysis.
We
>  > validate the effectiveness of a heuristic over some accepted set of
>  > benchmarks. Over time, we react to regressions or pathological cases
>  > observed in the field, by manually analyzing such cases, figuring out
an
>  > enhancement to the heuristic, and then re-validating over that set of
>  > benchmarks (maybe augmented by adding the newly found cases).
>  >
>  > Because heuristics are code that needs to be maintained, there is
> pressure
>  > to reduce complexity: adding more features means we need to reason
about
>  > the interactions between the old and new features, which scales
>  > combinatorially. Re-tuning because of the new features adds a similar
> kind
>  > of complexity. Potentially, we miss out on optimization improvements
as
> a
>  > result.
>  >
>  > Because tuning is manual, there is pressure to keep the number of
>  > benchmarks that can be studied in depth to a humanly-manageable size,
> which
>  > potentially affects the generality of a heuristic or heuristic
tuning.
>  >
>  > The main advantage of manual heuristics is arguably their relatively
> lower
>  > overhead: no additional dependencies and more transparent to human
> analysis
>  > and improvement.
>  >
>  > Machine learning, in particular reinforcement learning, can address
the
>  > tensions found in manual heuristics: once features are extracted from
> the
>  > program, the way they are combined and tuned can easily be scaled up
>  > through automation, improving effectiveness and generality. A major
>  > drawback, at least at this point in time, of machine learning, is
that
> we
>  > don’t yet have a fully developed systematic approach for improving
> policy
>  > effectiveness.
>  > High level design
>  >
>  > We identify two scenarios for a compiler using ML policies:
> development and
>  > release.
>  >
>  > The release scenario is equivalent to the regular compilation we have
> today
>  > - the only difference is that it uses a pre-trained model (trained in
> the
>  > development scenario beforehand) to make decisions instead of the
>  > heuristics. Determinism is guaranteed since the model in the release
>  > scenario is fixed. We imagine teams wishing to fine tune the
> effectiveness
>  > of the optimization to their scenarios would train a different model.
>  >
>  > The decision previously evaluated using a human-crafted heuristic is
>  > optionally replaced by:
>  >
>  >    -
>  >
>  >    a compiler-specific component, extracting features from IR (i.e. a
>  >    vector of values)
>  >    -
>  >
>  >    an evaluation of an ML model using those features, to obtain a
> result.
>  >    In ML nomenclature, this is referred to using the model for
> inference (as
>  >    opposed to training it)
>  >
>  > For example, when we replaced the decision of whether to inline a
> callsite,
>  > the ML model produces a boolean (inline/don’t inline) based on a
> features
>  > vector characterizing the call site and some broader module-wide
> context.
>  >
>  > Training/development is more complicated, and happens offline - akin
to
>  > how, today, attempts to improve an optimizing pass also happen
offline.
> A
>  > description of the high level design and the specifics we used for
the
>  > current scope are given in Appendix.
>  > Current Scope
>  >
>  > The goal of our first milestone was to evaluate end to end an
> integration
>  > of ML with LLVM, and get a first promising result. To that end, we
chose
>  > inlining for size (-Oz) as a stepping stone, as we perceived it to be
> more
>  > likely to require a simpler evaluation setup than
performance-oriented
>  > optimizations might. At this point, we only train whether a call site
> may
>  > be inlined or not, leaving the SCC traversal order as-is.
>  >
>  > We are proposing an initial change demonstrating the inference
mechanism
>  > using a pre-trained model, as a build-time opt-in to llvm. The
compiler
>  > components needed to perform training are also included in this first
>  > change. Subsequent changes would include more training-related
> components.
>  >
>  > At a high level, the changes we are proposing consist of:
>  >
>  >    1.
>  >
>  >    a new module analysis pass, InliningAdvisor. By default, its
>  >    implementation does nothing.
>  >    2.
>  >
>  >    minimal hooks into Inliner.cpp.
>  >    3.
>  >
>  >    the implementation of InliningAdvisor, activated when we opt-in
> ML. This
>  >    is available in Analysis/ML, together with:
>  >    1.
>  >
>  >       Rel/Dev specific ML model handing, also under Analysis/ML
>  >       2.
>  >
>  >       a pre-trained model for inlining for size
>  >       (Analysis/ML/models/inlining)
>  >       3.
>  >
>  >       a pre-trained model for predicting native size from IR
>  >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
>  >       4.
>  >
>  >    Some refactorings in PassBuilder, to allow opting into running
> mandatory
>  >    inlining first - some compilation speedup for the ML case,
minimal,
>  >    noise-like size effect. Also simplifies testing (these would be
> introduced
>  >    as a preliminary patch).
>  >
>  > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is
> more
>  > involved.
>  > Inference Opt-In Mechanism
>  >
>  > The feature is primarily controlled by the cmake flag
>  > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies.
The
>  > “Rel”ease case requires specifying the location of the pip tensorflow
>  > package (currently, that’s tf_nightly, and it should soon be
available
> in
>  > tensorflow)
>  >
>  > To opt in the ‘Rel’ case:
>  >
>  >    1.
>  >
>  >    install tensorflow pip package
>  >
>  > pip3 install tf_nightly --user
>  >
>  >    1.
>  >
>  >    configure llvm build
>  >
>  > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>  >
>  > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow
\
>  >
>  > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed -
it’s
>  > usually in the path>}
>  >
>  >
>  >
>  >    1.
>  >
>  >    build llvm as usual.
>  >    2.
>  >
>  >    pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first
to
>  >    clang.
>  >
>  > Details
>  >
>  > The ML model is captured as a TensorFlow ‘saved model’. When building
> llvm,
>  > we use TensorFlow’s  XLA native compiler (saved_model_cli) to compile
> the
>  > saved model into a native static library and a header file. Insofar
> as LLVM
>  > is concerned, there are minimal additional runtime requirements,
> packaged
>  > as source within the pip package: C++ wrappers around the compiled
> model.
>  > These will also be statically linked in the LLVM target. The compiled
> code
>  > is otherwise just a complex arithmetical computation, with no special
>  > requirements - it is single threaded and runs natively on the
targeted
>  > architecture. Together with the aforementioned runtime dependencies,
it
>  > adds ~115KB to the clang binary (0.08% increase)
>  >
>  > Runtime-wise, we observed a ~10% increase in the time spent in the
> inliner,
>  > for a large (33MB) binary IR module; inlining typically consumes
> ~10-15% of
>  > total compilation time, so the overall compile time overhead of the
>  > approach is arguably negligible. This cost is almost in entirety
>  > attributable to feature extraction.
>  >
>  > Memory-wise, the precompiled model has a fixed size buffer for its
> inputs,
>  > and performs a fixed amount of computations, so the memory overhead
>  > inherent to our approach is independent from the program being
> optimized.
>  > Using a small example to avoid effects such as memory use differences
> due
>  > to different inlinings, we observed an 300KB increase in the maximum
>  > resident size.
>  >
>  > A table showing effect on -Oz compiled binaries’ size is given in
> Appendix.
>  > Next directions
>  >
>  > Our next milestone has two main high level goals: detailing a
systematic
>  > approach to driving policy effectiveness; and exploring in depth the
> type
>  > of features and training algorithms most appropriate for compiler
> problems,
>  > or at least problems like inlining. For the latter, we expect
embedding
>  > more of the call graph structure to play an important role, as well
as,
>  > potentially, delegating the problem space traversal to the ML model.
>  >
>  > We plan to include inlining for speed as part of our work on these
> goals.
>  > AppendixTraining - High Level
>  >
>  > We use Reinforcement Learning (RL) to train the Inline-for-size
> model. At a
>  > high level, it is composed of 3 parts: training data collection,
model
>  > training, and iterative data collection/model training. We use
> TensorFlow
>  > as our ML framework.
>  >
>  > Related, we also needed to learn a separate model to evaluate the
native
>  > size of a function, given its IR, in order to calculate a more
precise
>  > reward for the reinforcement learning algorithm (“IR2Native”). We
> evaluated
>  > ‘just counting IR’ and TargetTransformInfo, but they appeared to
provide
>  > too noisy of a signal for the reward, insofar as the RL training
> algorithm
>  > for the inlining model was concerned. This model is only used during
>  > training.
>  >
>  > RL - Training data collection: the training data we need to feed into
a
>  > reinforcement learning algorithm are sequences consisting of: state
> of the
>  > problem (i.e. features); action (inline/not inline), and reward
(native
>  > size shrinkage after inline/not inline, using ir2native). To collect
the
>  > sequences, we hook the logging infrastructure into LLVM Inliner that
is
>  > able to produce logs after the inline optimization pass.
>  >
>  > RL - Model training: We use DQN (Deep Q-Network) to train our
>  > inlining-for-size ML policy. On a high level, the DQN algorithm
trains a
>  > neural network to predict the value of different actions --- the DQN
> policy
>  > then chooses to take the action with the highest predicted value. In
our
>  > scenario, we have two actions: 1) inline; 2) not inline, so the
neural
>  > network predicts the size reduction of these two actions based on
> features,
>  > and then decides to conduct inlining if the neural network believes
> doing
>  > inlining leads to higher size reduction. After the training finishes,
it
>  > produces a TensorFlow SavedModel that takes features as input and
> generates
>  > inline decisions (whether to inline or not) as output.
>  >
>  > The choice of the features and reward are essential in model
> training. The
>  > features are chosen with the consideration of being helpful in making
> the
>  > decision --- the input to the cost model is a good starting point.
> Ideally,
>  > the reward in the inline-for-size problem is the native size
shrinkage
>  > after inline/not inline. It is difficult to obtain precisely, so we
> use the
>  > estimate as an alternative. This means that, for training, we also
need
> a
>  > model (not necessarily machine learned) for estimating rewards.
>  >
>  > RL - Iterative data collection/model training: Reinforcement learning
is
>  > ideally an iterative model/policy improvement process that: 1) the
> trained
>  > model is deployed to the field to collect new data; 2) newly
> collected data
>  > are used to update the model. Thus, we need to do iterative data
>  > collection/model training. To facilitate data collection (avoid
complex
>  > build dependencies and time spent before/after the pass being
> trained), we
>  > isolate out IR modules captured right before the optimization we are
>  > interested in, and iterate on them with opt. We bootstrap the
> training from
>  > the current heuristic, and stop the process once we are happy with
the
>  > outcome.
>  >
>  > IR2Native: We collect IR features (different from the ones used for
>  > inlining) for each function at the end of inlining, right before we
> perform
>  > function simplification passes, and right after. This means we have
> two IR
>  > ‘shapes’ of the same function, and we know no further inlinings will
be
>  > performed, so whatever changes happen are based on that IR. We then
> extract
>  > the native size at the end of compilation. Together, this data forms
two
>  > records per function that can be used in supervised learning - the
> features
>  > are those extracted from IR, and the label is the native size.
Training
>  > IR2Native happens independently from the training of the inliner
model.
>  > Training support for the current scope
>  >
>  > The initial change includes the logging mechanism, an ir2native model
>  > trained for x86-64, and the means to rapidly iterate over development
ML
>  > models. For the components that will be included in subsequent
> changes, the
>  > rest of this section describes the mechanisms we employed. These
> components
>  > are detailed further below.
>  >
>  > To build LLVM with the ML policy in ‘Dev’ mode, we need the
tensorflow C
>  > API library <https://www.tensorflow.org/install/lang_c>. We
then
> configure
>  > the build:
>  >
>  > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>  >
>  > -DLLVM_TF_C_LIB=<path to unarchived package> \
>  >
>  > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
>  > library over, if it’s not on LD_LIBRARY_PATH}
>  >
>  >
>  > To extract IR right before inlining, we hacked on top of the ThinLTO
>  > infrastructure, interrupting its pre-link pipeline right before
> inlining.
>  > This means clang produces binary IR files captured at that stage. We
> then
>  > built a large target, obtaining a corpus of ~25K modules. We intend
to
>  > provide a clean mechanism in a subsequent change.
>  >
>  > To obtain features/labels for training this “IR to Native Size”
model,
> we
>  > had to make some changes to the AsmPrinter (to get real native sizes)
> and
>  > llvm-readobj, as well as add an analysis pass for extracting the IR
>  > features for this model. We plan on upstreaming these changes
> subsequently.
>  >
>  > Finally, the infrastructure driving the policy training is currently
> built
>  > on proprietary APIs, as it benefits from a distributed computing
>  > infrastructure. We are currently evaluating options for open sourcing
> it.
>  > In the meantime, we are presenting the high level implementation
> details.
>  >
>  > To train a new model, the infrastructure performs 2 steps: extracting
> the
>  > logs, and using them in a training algorithm.
>  >
>  > Log extraction is highly parallelizable: for each IR module in the
> training
>  > corpus, we need to run opt once (or a few times, when we explore
>  > improvements). Specifically, each run is this invocation:
>  >
>  > opt -passes=scc-oz-module-inliner
-ml-inliner-ir2native-model=<path to
>  > ir2native> -training-log=<path to training log output>
> -enable-ml-inliner
>  > -mandatory-inlinings-first -o <output> <module.o>
>  >
>  > Then collect the logs, and pass them to the next step.
>  >
>  >
>  > Experimental results
>  >
>  > Experimental results are available as follows:
>  >
>  >
>  >    -
>  >
>  >    SPEC2006
>  >
> <
>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
> >
>  >    binary sizes (-Oz) and ‘test run’ scores.
>  >    -
>  >
>  >    Size report
>  >
> <
>
https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
> >
>  >    from some internal benchmarks and binaries, including opt and
clang
>  >
>  >
>  > _______________________________________________
>  > LLVM Developers mailing list
>  > llvm-dev at lists.llvm.org
>  > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/bd2f145e/attachment-0001.html>

Adrian Prantl via llvm-dev

2020-Apr-09 17:47 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

> On Apr 8, 2020, at 2:04 PM, Mircea Trofin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> TL;DR; We can improve compiler optimizations driven by heuristics by
replacing those heuristics with machine-learned policies (ML models). Policies
are trained offline and ship as part of the compiler. Determinism is maintained
because models are fixed when the compiler is operating in production.
Fine-tuning or regressions may be handled by incorporating the interesting cases
in the ML training set, retraining the compiler, and redeploying it.
> For a first milestone, we chose inlining for size (-Oz) on X86-64. We were
able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
tip-of-tree. The trained model appears to generalize well over a diverse set of
binaries. Compile time is increased marginally (under 5%). The model also
happens to produce slightly better - performing code under SPEC2006 - total
score improvement by 1.75%. As we only wanted to verify there is no significant
regression in SPEC, and given the milestone goals, we haven’t dug any deeper
into the speed results.
> 
How generally applicable are the results? I find it easy to believe that if, for
example, you train a model on SPEC2006, you will get fantastic results when
compiling SPEC2006, but does that translate well to other inputs? Or is the idea
to train the compiler on the exact input program to be compiled, (perhaps in a
CI of sorts), so you end up with a specialized compiler for each input program?

-- adrian

Mircea Trofin via llvm-dev

2020-Apr-09 18:01 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

On Thu, Apr 9, 2020 at 10:47 AM Adrian Prantl <aprantl at apple.com>
wrote:
>
>
> > On Apr 8, 2020, at 2:04 PM, Mircea Trofin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > TL;DR; We can improve compiler optimizations driven by heuristics by
> replacing those heuristics with machine-learned policies (ML models).
> Policies are trained offline and ship as part of the compiler. Determinism
> is maintained because models are fixed when the compiler is operating in
> production. Fine-tuning or regressions may be handled by incorporating the
> interesting cases in the ML training set, retraining the compiler, and
> redeploying it.
> > For a first milestone, we chose inlining for size (-Oz) on X86-64. We
> were able to train an ML model to produce binaries 1.5-6% smaller than -Oz
> of tip-of-tree. The trained model appears to generalize well over a diverse
> set of binaries. Compile time is increased marginally (under 5%). The model
> also happens to produce slightly better - performing code under SPEC2006 -
> total score improvement by 1.75%. As we only wanted to verify there is no
> significant regression in SPEC, and given the milestone goals, we haven’t
> dug any deeper into the speed results.
> >
>
> How generally applicable are the results? I find it easy to believe that
> if, for example, you train a model on SPEC2006, you will get fantastic
> results when compiling SPEC2006, but does that translate well to other
> inputs?

The current model was trained on an internal corpus of ~25K modules,
obtained from an internal "search" binary. Only ~20 or so are shared
with
llvm, for example. In particular, no SPEC modules were used for training.
So the results appear to generalize reasonably well.

Or is the idea to train the compiler on the exact input program to
be> compiled, (perhaps in a CI of sorts), so you end up with a specialized
> compiler for each input program?
>
That's the anti-goal :) The hope is to have reference policies, trained
from a representative corpus, which are 'good enough' for general use
(just
like manual heuristics today), while having a systematic methodology for
interested parties to go the extra step and fine-tune (retrain) to their
specifics.

>
> -- adrian-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200409/d0488e91/attachment.html>

Owen Anderson via llvm-dev

2020-Apr-12 07:45 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Hi Mircea,

A few questions:

- Why is a code size estimator needed? Can't you roll out the compilation
process to get an accurate final size.
- Given that you have a trained code size estimator, how effective would MCTS be
compared to the Q-network?
- Can you provide more information on the feature engineering needed to get this
to work well?
- Can you comment on the time / data set requirements to retrain this model from
scratch?

--Owen
> On Apr 8, 2020, at 2:04 PM, Mircea Trofin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> TL;DR; We can improve compiler optimizations driven by heuristics by
replacing those heuristics with machine-learned policies (ML models). Policies
are trained offline and ship as part of the compiler. Determinism is maintained
because models are fixed when the compiler is operating in production.
Fine-tuning or regressions may be handled by incorporating the interesting cases
in the ML training set, retraining the compiler, and redeploying it.
> For a first milestone, we chose inlining for size (-Oz) on X86-64. We were
able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
tip-of-tree. The trained model appears to generalize well over a diverse set of
binaries. Compile time is increased marginally (under 5%). The model also
happens to produce slightly better - performing code under SPEC2006 - total
score improvement by 1.75%. As we only wanted to verify there is no significant
regression in SPEC, and given the milestone goals, we haven’t dug any deeper
into the speed results.
> 
> We see these results as promising, and as a reasonable point for
contributing our current work as a build-time opt-in to LLVM to benefit the
community, in the hope of fostering collaboration and learning from the
community’s feedback, as we try to better understand the trade-offs such an
approach entails, and as we work on expanding the depth and breadth of applying
these techniques to compiler optimization problems.
> 
> https://reviews.llvm.org/D77752 <https://reviews.llvm.org/D77752>
> Motivation
> Optimization problems, such as inlining or register allocation, are hard:
we don’t know of algorithms that are guaranteed to produce the optimum solution
in all cases in a timely fashion. Instead, we use heuristics: algorithms that we
expect to produce some approximation of the optimum, with some expected degree
of generality, in a practical amount of time.
> 
> Heuristics have some common characteristics. Taking inlining as a case
study, it traverses the problem space in some way (bottom-up traversal of the
SCC graph), extracts some properties (let’s call them “features”) of the program
being optimized, and combine them with some weights (“tune”), to produce a cost
(InlineCost), which allows for trade-off analysis. We validate the effectiveness
of a heuristic over some accepted set of benchmarks. Over time, we react to
regressions or pathological cases observed in the field, by manually analyzing
such cases, figuring out an enhancement to the heuristic, and then re-validating
over that set of benchmarks (maybe augmented by adding the newly found cases).
> 
> Because heuristics are code that needs to be maintained, there is pressure
to reduce complexity: adding more features means we need to reason about the
interactions between the old and new features, which scales combinatorially.
Re-tuning because of the new features adds a similar kind of complexity.
Potentially, we miss out on optimization improvements as a result.
> 
> Because tuning is manual, there is pressure to keep the number of
benchmarks that can be studied in depth to a humanly-manageable size, which
potentially affects the generality of a heuristic or heuristic tuning.
> 
> The main advantage of manual heuristics is arguably their relatively lower
overhead: no additional dependencies and more transparent to human analysis and
improvement.
> 
> Machine learning, in particular reinforcement learning, can address the
tensions found in manual heuristics: once features are extracted from the
program, the way they are combined and tuned can easily be scaled up through
automation, improving effectiveness and generality. A major drawback, at least
at this point in time, of machine learning, is that we don’t yet have a fully
developed systematic approach for improving policy effectiveness.
> 
> High level design
> We identify two scenarios for a compiler using ML policies: development and
release.
> 
> The release scenario is equivalent to the regular compilation we have today
- the only difference is that it uses a pre-trained model (trained in the
development scenario beforehand) to make decisions instead of the heuristics.
Determinism is guaranteed since the model in the release scenario is fixed. We
imagine teams wishing to fine tune the effectiveness of the optimization to
their scenarios would train a different model.
> 
> The decision previously evaluated using a human-crafted heuristic is
optionally replaced by:
> 
> a compiler-specific component, extracting features from IR (i.e. a vector
of values)
> an evaluation of an ML model using those features, to obtain a result. In
ML nomenclature, this is referred to using the model for inference (as opposed
to training it)
> 
> For example, when we replaced the decision of whether to inline a callsite,
the ML model produces a boolean (inline/don’t inline) based on a features vector
characterizing the call site and some broader module-wide context.
> 
> Training/development is more complicated, and happens offline - akin to
how, today, attempts to improve an optimizing pass also happen offline. A
description of the high level design and the specifics we used for the current
scope are given in Appendix.
> 
> Current Scope
> The goal of our first milestone was to evaluate end to end an integration
of ML with LLVM, and get a first promising result. To that end, we chose
inlining for size (-Oz) as a stepping stone, as we perceived it to be more
likely to require a simpler evaluation setup than performance-oriented
optimizations might. At this point, we only train whether a call site may be
inlined or not, leaving the SCC traversal order as-is.
> 
> We are proposing an initial change demonstrating the inference mechanism
using a pre-trained model, as a build-time opt-in to llvm. The compiler
components needed to perform training are also included in this first change.
Subsequent changes would include more training-related components.
> 
> At a high level, the changes we are proposing consist of:
> 
> a new module analysis pass, InliningAdvisor. By default, its implementation
does nothing.
> minimal hooks into Inliner.cpp.
> the implementation of InliningAdvisor, activated when we opt-in ML. This is
available in Analysis/ML, together with:
> Rel/Dev specific ML model handing, also under Analysis/ML
> a pre-trained model for inlining for size (Analysis/ML/models/inlining)
> a pre-trained model for predicting native size from IR
(Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
> Some refactorings in PassBuilder, to allow opting into running mandatory
inlining first - some compilation speedup for the ML case, minimal, noise-like
size effect. Also simplifies testing (these would be introduced as a preliminary
patch).
> 
> We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
involved.
> 
> Inference Opt-In Mechanism
> The feature is primarily controlled by the cmake flag
LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The “Rel”ease
case requires specifying the location of the pip tensorflow package (currently,
that’s tf_nightly, and it should soon be available in tensorflow)
> 
> To opt in the ‘Rel’ case:
> 
> install tensorflow pip package
> 
> pip3 install tf_nightly --user
> 
> configure llvm build
> 
>      cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>   -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
>  {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
usually in the path>}
> 
> build llvm as usual.
> pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to clang.
> 
> Details
> The ML model is captured as a TensorFlow ‘saved model’. When building llvm,
we use TensorFlow’s  XLA native compiler (saved_model_cli) to compile the saved
model into a native static library and a header file. Insofar as LLVM is
concerned, there are minimal additional runtime requirements, packaged as source
within the pip package: C++ wrappers around the compiled model. These will also
be statically linked in the LLVM target. The compiled code is otherwise just a
complex arithmetical computation, with no special requirements - it is single
threaded and runs natively on the targeted architecture. Together with the
aforementioned runtime dependencies, it adds ~115KB to the clang binary (0.08%
increase)
> 
> Runtime-wise, we observed a ~10% increase in the time spent in the inliner,
for a large (33MB) binary IR module; inlining typically consumes ~10-15% of
total compilation time, so the overall compile time overhead of the approach is
arguably negligible. This cost is almost in entirety attributable to feature
extraction.
> 
> Memory-wise, the precompiled model has a fixed size buffer for its inputs,
and performs a fixed amount of computations, so the memory overhead inherent to
our approach is independent from the program being optimized. Using a small
example to avoid effects such as memory use differences due to different
inlinings, we observed an 300KB increase in the maximum resident size.
> 
> A table showing effect on -Oz compiled binaries’ size is given in Appendix.
> 
> Next directions
> Our next milestone has two main high level goals: detailing a systematic
approach to driving policy effectiveness; and exploring in depth the type of
features and training algorithms most appropriate for compiler problems, or at
least problems like inlining. For the latter, we expect embedding more of the
call graph structure to play an important role, as well as, potentially,
delegating the problem space traversal to the ML model.
> 
> We plan to include inlining for speed as part of our work on these goals.
> 
> Appendix
> Training - High Level
> We use Reinforcement Learning (RL) to train the Inline-for-size model. At a
high level, it is composed of 3 parts: training data collection, model training,
and iterative data collection/model training. We use TensorFlow as our ML
framework.
> 
> Related, we also needed to learn a separate model to evaluate the native
size of a function, given its IR, in order to calculate a more precise reward
for the reinforcement learning algorithm (“IR2Native”). We evaluated ‘just
counting IR’ and TargetTransformInfo, but they appeared to provide too noisy of
a signal for the reward, insofar as the RL training algorithm for the inlining
model was concerned. This model is only used during training.
> 
> RL - Training data collection: the training data we need to feed into a
reinforcement learning algorithm are sequences consisting of: state of the
problem (i.e. features); action (inline/not inline), and reward (native size
shrinkage after inline/not inline, using ir2native). To collect the sequences,
we hook the logging infrastructure into LLVM Inliner that is able to produce
logs after the inline optimization pass.
> 
> RL - Model training: We use DQN (Deep Q-Network) to train our
inlining-for-size ML policy. On a high level, the DQN algorithm trains a neural
network to predict the value of different actions --- the DQN policy then
chooses to take the action with the highest predicted value. In our scenario, we
have two actions: 1) inline; 2) not inline, so the neural network predicts the
size reduction of these two actions based on features, and then decides to
conduct inlining if the neural network believes doing inlining leads to higher
size reduction. After the training finishes, it produces a TensorFlow SavedModel
that takes features as input and generates inline decisions (whether to inline
or not) as output.
> 
> The choice of the features and reward are essential in model training. The
features are chosen with the consideration of being helpful in making the
decision --- the input to the cost model is a good starting point. Ideally, the
reward in the inline-for-size problem is the native size shrinkage after
inline/not inline. It is difficult to obtain precisely, so we use the estimate
as an alternative. This means that, for training, we also need a model (not
necessarily machine learned) for estimating rewards.
> 
> RL - Iterative data collection/model training: Reinforcement learning is
ideally an iterative model/policy improvement process that: 1) the trained model
is deployed to the field to collect new data; 2) newly collected data are used
to update the model. Thus, we need to do iterative data collection/model
training. To facilitate data collection (avoid complex build dependencies and
time spent before/after the pass being trained), we isolate out IR modules
captured right before the optimization we are interested in, and iterate on them
with opt. We bootstrap the training from the current heuristic, and stop the
process once we are happy with the outcome.
> 
> IR2Native: We collect IR features (different from the ones used for
inlining) for each function at the end of inlining, right before we perform
function simplification passes, and right after. This means we have two IR
‘shapes’ of the same function, and we know no further inlinings will be
performed, so whatever changes happen are based on that IR. We then extract the
native size at the end of compilation. Together, this data forms two records per
function that can be used in supervised learning - the features are those
extracted from IR, and the label is the native size. Training IR2Native happens
independently from the training of the inliner model.
> 
> Training support for the current scope
> The initial change includes the logging mechanism, an ir2native model
trained for x86-64, and the means to rapidly iterate over development ML models.
For the components that will be included in subsequent changes, the rest of this
section describes the mechanisms we employed. These components are detailed
further below.
> 
> To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
API library <https://www.tensorflow.org/install/lang_c>. We then configure
the build:
> 
> cmake .. -DLLVM_USE_ML_POLICY=Dev \
> -DLLVM_TF_C_LIB=<path to unarchived package> \
> {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
library over, if it’s not on LD_LIBRARY_PATH}
> 
> To extract IR right before inlining, we hacked on top of the ThinLTO
infrastructure, interrupting its pre-link pipeline right before inlining. This
means clang produces binary IR files captured at that stage. We then built a
large target, obtaining a corpus of ~25K modules. We intend to provide a clean
mechanism in a subsequent change.
> 
> To obtain features/labels for training this “IR to Native Size” model, we
had to make some changes to the AsmPrinter (to get real native sizes) and
llvm-readobj, as well as add an analysis pass for extracting the IR features for
this model. We plan on upstreaming these changes subsequently.
> 
> Finally, the infrastructure driving the policy training is currently built
on proprietary APIs, as it benefits from a distributed computing infrastructure.
We are currently evaluating options for open sourcing it. In the meantime, we
are presenting the high level implementation details.
> 
> To train a new model, the infrastructure performs 2 steps: extracting the
logs, and using them in a training algorithm.
> 
> Log extraction is highly parallelizable: for each IR module in the training
corpus, we need to run opt once (or a few times, when we explore improvements).
Specifically, each run is this invocation:
> 
> opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
ir2native> -training-log=<path to training log output>
-enable-ml-inliner -mandatory-inlinings-first -o <output> <module.o>
> 
> Then collect the logs, and pass them to the next step.
> 
> 
> Experimental results
> Experimental results are available as follows:
> 
> 
> SPEC2006
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
binary sizes (-Oz) and ‘test run’ scores.
> Size report
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
from some internal benchmarks and binaries, including opt and clang
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200412/5dc1afab/attachment-0001.html>

Mircea Trofin via llvm-dev

2020-Apr-12 16:10 UTC

head link

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

On Sun, Apr 12, 2020 at 12:45 AM Owen Anderson <resistor at mac.com>
wrote:
> Hi Mircea,
>
> A few questions:
>
> - Why is a code size estimator needed? Can't you roll out the
compilation
> process to get an accurate final size.
>
The training algorithm (DQN) is such that it wants a partial reward after
each action. Yundi can provide a deeper answer here. We are considering
alternatives where we could do away with it, though.

> - Given that you have a trained code size estimator, how effective would
> MCTS be compared to the Q-network?
>
I assume you mean doing MCTS 'live', at compilation time? For release
mode,
we'd be concerned with timeliness and determinism. We are considering doing
that for training, though, since that happens offline, and we don't have
those concerns either.

> - Can you provide more information on the feature engineering needed to
> get this to work well?
>
This initial set was chosen mostly for expedience, with the goal of getting
an end to end system working. We also chose to be conservative with our
features - we wanted to explore the limits of hand-picked features, before
going into embedding IR or the SCC graph (for example)

First we wanted to check if some hand-crafted feature set can be used to
train a policy for behavioral cloning - i.e. can we teach a model to
mimic what the manual heuristic does.

The initial set was: total number of functions and call edges;
caller/callee block counts, ir size, users, and conditionally-executed
blocks; number of constant parameters; and the cost estimate of the call
site (i.e. today's cost model, without using thresholds, which gives an
analytical signal of the "inlinability" of the site).

We learned these features weren't good enough - I forget the training
adherence (== the probability the learned model makes the same decision as
the manual policy), Yundi may remember, in any case, it was low.

We added as a result the 'height' feature - the position in the DAG of a
call site, which we chose to be the longest path from the SCC containing
that call site, to a leaf SCC. This proved critical - we got to over 99%
training adherence, and also a modest (~0.4) size reduction.

We also determined capturing the IR sizes of the caller/callee introduced
noise in the further training, so we eliminated them.

> - Can you comment on the time / data set requirements to retrain this
> model from scratch?
>
We use a set of about 25K modules, peeled off an internal app. The
exploration time takes ~15 minutes on what is roughly equivalent to a few
hundred hardware threads (a distributed environment). Training happens
locally, and one iteration is ~30 min (36 core, 72 thread machine). Note
that 'from scratch' is really 'starting from what can be learned
from the
manual heuristic'. So far, we got no additional benefit from more than one
iteration. We believe embedding global information more thoroughly, for
example, will very likely change that; improvements in the ir2native model,
which seems to be more noisy after the first iteration; and/or using
training algorithm alternatives where we can do away with the ir2native
model altogether, like you suggested.

>
> --Owen
>
> On Apr 8, 2020, at 2:04 PM, Mircea Trofin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> TL;DR; We can improve compiler optimizations driven by heuristics by
> replacing those heuristics with machine-learned policies (ML models).
> Policies are trained offline and ship as part of the compiler. Determinism
> is maintained because models are fixed when the compiler is operating in
> production. Fine-tuning or regressions may be handled by incorporating the
> interesting cases in the ML training set, retraining the compiler, and
> redeploying it.
>
> For a first milestone, we chose inlining for size (-Oz) on X86-64. We were
> able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
> tip-of-tree. The trained model appears to generalize well over a diverse
> set of binaries. Compile time is increased marginally (under 5%). The model
> also happens to produce slightly better - performing code under SPEC2006 -
> total score improvement by 1.75%. As we only wanted to verify there is no
> significant regression in SPEC, and given the milestone goals, we haven’t
> dug any deeper into the speed results.
>
> We see these results as promising, and as a reasonable point for
> contributing our current work as a build-time opt-in to LLVM to benefit the
> community, in the hope of fostering collaboration and learning from the
> community’s feedback, as we try to better understand the trade-offs such an
> approach entails, and as we work on expanding the depth and breadth of
> applying these techniques to compiler optimization problems.
>
> https://reviews.llvm.org/D77752
> Motivation
>
> Optimization problems, such as inlining or register allocation, are hard:
> we don’t know of algorithms that are guaranteed to produce the optimum
> solution in all cases in a timely fashion. Instead, we use heuristics:
> algorithms that we expect to produce some approximation of the optimum,
> with some expected degree of generality, in a practical amount of time.
>
> Heuristics have some common characteristics. Taking inlining as a case
> study, it traverses the problem space in some way (bottom-up traversal of
> the SCC graph), extracts some properties (let’s call them “features”) of
> the program being optimized, and combine them with some weights (“tune”),
> to produce a cost (InlineCost), which allows for trade-off analysis. We
> validate the effectiveness of a heuristic over some accepted set of
> benchmarks. Over time, we react to regressions or pathological cases
> observed in the field, by manually analyzing such cases, figuring out an
> enhancement to the heuristic, and then re-validating over that set of
> benchmarks (maybe augmented by adding the newly found cases).
>
> Because heuristics are code that needs to be maintained, there is pressure
> to reduce complexity: adding more features means we need to reason about
> the interactions between the old and new features, which scales
> combinatorially. Re-tuning because of the new features adds a similar kind
> of complexity. Potentially, we miss out on optimization improvements as a
> result.
>
> Because tuning is manual, there is pressure to keep the number of
> benchmarks that can be studied in depth to a humanly-manageable size, which
> potentially affects the generality of a heuristic or heuristic tuning.
>
> The main advantage of manual heuristics is arguably their relatively lower
> overhead: no additional dependencies and more transparent to human analysis
> and improvement.
>
> Machine learning, in particular reinforcement learning, can address the
> tensions found in manual heuristics: once features are extracted from the
> program, the way they are combined and tuned can easily be scaled up
> through automation, improving effectiveness and generality. A major
> drawback, at least at this point in time, of machine learning, is that we
> don’t yet have a fully developed systematic approach for improving policy
> effectiveness.
> High level design
>
> We identify two scenarios for a compiler using ML policies: development
> and release.
>
> The release scenario is equivalent to the regular compilation we have
> today - the only difference is that it uses a pre-trained model (trained in
> the development scenario beforehand) to make decisions instead of the
> heuristics. Determinism is guaranteed since the model in the release
> scenario is fixed. We imagine teams wishing to fine tune the effectiveness
> of the optimization to their scenarios would train a different model.
>
> The decision previously evaluated using a human-crafted heuristic is
> optionally replaced by:
>
>    - a compiler-specific component, extracting features from IR (i.e. a
>    vector of values)
>    -
>
>    an evaluation of an ML model using those features, to obtain a result.
>    In ML nomenclature, this is referred to using the model for inference
(as
>    opposed to training it)
>
> For example, when we replaced the decision of whether to inline a
> callsite, the ML model produces a boolean (inline/don’t inline) based on a
> features vector characterizing the call site and some broader module-wide
> context.
>
> Training/development is more complicated, and happens offline - akin to
> how, today, attempts to improve an optimizing pass also happen offline. A
> description of the high level design and the specifics we used for the
> current scope are given in Appendix.
> Current Scope
>
> The goal of our first milestone was to evaluate end to end an integration
> of ML with LLVM, and get a first promising result. To that end, we chose
> inlining for size (-Oz) as a stepping stone, as we perceived it to be more
> likely to require a simpler evaluation setup than performance-oriented
> optimizations might. At this point, we only train whether a call site may
> be inlined or not, leaving the SCC traversal order as-is.
>
> We are proposing an initial change demonstrating the inference mechanism
> using a pre-trained model, as a build-time opt-in to llvm. The compiler
> components needed to perform training are also included in this first
> change. Subsequent changes would include more training-related components.
>
> At a high level, the changes we are proposing consist of:
>
>    1. a new module analysis pass, InliningAdvisor. By default, its
>    implementation does nothing.
>    2. minimal hooks into Inliner.cpp.
>    3. the implementation of InliningAdvisor, activated when we opt-in ML.
>    This is available in Analysis/ML, together with:
>    1. Rel/Dev specific ML model handing, also under Analysis/ML
>       2. a pre-trained model for inlining for size
>       (Analysis/ML/models/inlining)
>       3. a pre-trained model for predicting native size from IR
>       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
>       4.
>
>    Some refactorings in PassBuilder, to allow opting into running
>    mandatory inlining first - some compilation speedup for the ML case,
>    minimal, noise-like size effect. Also simplifies testing (these would be
>    introduced as a preliminary patch).
>
> We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
> involved.
> Inference Opt-In Mechanism
>
> The feature is primarily controlled by the cmake flag
> LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
> “Rel”ease case requires specifying the location of the pip tensorflow
> package (currently, that’s tf_nightly, and it should soon be available in
> tensorflow)
>
> To opt in the ‘Rel’ case:
>
>    1.
>
>    install tensorflow pip package
>
> pip3 install tf_nightly --user
>
>    1.
>
>    configure llvm build
>
> cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
> {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
> usually in the path>}
>
>
>
>    1. build llvm as usual.
>    2.
>
>    pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
>    clang.
>
> Details
>
> The ML model is captured as a TensorFlow ‘saved model’. When building
> llvm, we use TensorFlow’s  XLA native compiler (saved_model_cli) to
> compile the saved model into a native static library and a header file.
> Insofar as LLVM is concerned, there are minimal additional runtime
> requirements, packaged as source within the pip package: C++ wrappers
> around the compiled model. These will also be statically linked in the LLVM
> target. The compiled code is otherwise just a complex arithmetical
> computation, with no special requirements - it is single threaded and runs
> natively on the targeted architecture. Together with the aforementioned
> runtime dependencies, it adds ~115KB to the clang binary (0.08% increase)
>
> Runtime-wise, we observed a ~10% increase in the time spent in the
> inliner, for a large (33MB) binary IR module; inlining typically consumes
> ~10-15% of total compilation time, so the overall compile time overhead of
> the approach is arguably negligible. This cost is almost in entirety
> attributable to feature extraction.
>
> Memory-wise, the precompiled model has a fixed size buffer for its inputs,
> and performs a fixed amount of computations, so the memory overhead
> inherent to our approach is independent from the program being optimized.
> Using a small example to avoid effects such as memory use differences due
> to different inlinings, we observed an 300KB increase in the maximum
> resident size.
>
> A table showing effect on -Oz compiled binaries’ size is given in Appendix.
> Next directions
>
> Our next milestone has two main high level goals: detailing a systematic
> approach to driving policy effectiveness; and exploring in depth the type
> of features and training algorithms most appropriate for compiler problems,
> or at least problems like inlining. For the latter, we expect embedding
> more of the call graph structure to play an important role, as well as,
> potentially, delegating the problem space traversal to the ML model.
>
> We plan to include inlining for speed as part of our work on these goals.
> AppendixTraining - High Level
>
> We use Reinforcement Learning (RL) to train the Inline-for-size model. At
> a high level, it is composed of 3 parts: training data collection, model
> training, and iterative data collection/model training. We use TensorFlow
> as our ML framework.
>
> Related, we also needed to learn a separate model to evaluate the native
> size of a function, given its IR, in order to calculate a more precise
> reward for the reinforcement learning algorithm (“IR2Native”). We evaluated
> ‘just counting IR’ and TargetTransformInfo, but they appeared to provide
> too noisy of a signal for the reward, insofar as the RL training algorithm
> for the inlining model was concerned. This model is only used during
> training.
>
> RL - Training data collection: the training data we need to feed into a
> reinforcement learning algorithm are sequences consisting of: state of the
> problem (i.e. features); action (inline/not inline), and reward (native
> size shrinkage after inline/not inline, using ir2native). To collect the
> sequences, we hook the logging infrastructure into LLVM Inliner that is
> able to produce logs after the inline optimization pass.
>
> RL - Model training: We use DQN (Deep Q-Network) to train our
> inlining-for-size ML policy. On a high level, the DQN algorithm trains a
> neural network to predict the value of different actions --- the DQN policy
> then chooses to take the action with the highest predicted value. In our
> scenario, we have two actions: 1) inline; 2) not inline, so the neural
> network predicts the size reduction of these two actions based on features,
> and then decides to conduct inlining if the neural network believes doing
> inlining leads to higher size reduction. After the training finishes, it
> produces a TensorFlow SavedModel that takes features as input and generates
> inline decisions (whether to inline or not) as output.
>
> The choice of the features and reward are essential in model training. The
> features are chosen with the consideration of being helpful in making the
> decision --- the input to the cost model is a good starting point. Ideally,
> the reward in the inline-for-size problem is the native size shrinkage
> after inline/not inline. It is difficult to obtain precisely, so we use the
> estimate as an alternative. This means that, for training, we also need a
> model (not necessarily machine learned) for estimating rewards.
>
> RL - Iterative data collection/model training: Reinforcement learning is
> ideally an iterative model/policy improvement process that: 1) the trained
> model is deployed to the field to collect new data; 2) newly collected data
> are used to update the model. Thus, we need to do iterative data
> collection/model training. To facilitate data collection (avoid complex
> build dependencies and time spent before/after the pass being trained), we
> isolate out IR modules captured right before the optimization we are
> interested in, and iterate on them with opt. We bootstrap the training from
> the current heuristic, and stop the process once we are happy with the
> outcome.
>
> IR2Native: We collect IR features (different from the ones used for
> inlining) for each function at the end of inlining, right before we perform
> function simplification passes, and right after. This means we have two IR
> ‘shapes’ of the same function, and we know no further inlinings will be
> performed, so whatever changes happen are based on that IR. We then extract
> the native size at the end of compilation. Together, this data forms two
> records per function that can be used in supervised learning - the features
> are those extracted from IR, and the label is the native size. Training
> IR2Native happens independently from the training of the inliner model.
> Training support for the current scope
>
> The initial change includes the logging mechanism, an ir2native model
> trained for x86-64, and the means to rapidly iterate over development ML
> models. For the components that will be included in subsequent changes, the
> rest of this section describes the mechanisms we employed. These components
> are detailed further below.
>
> To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
> API library <https://www.tensorflow.org/install/lang_c>. We then
> configure the build:
> cmake .. -DLLVM_USE_ML_POLICY=Dev \
> -DLLVM_TF_C_LIB=<path to unarchived package> \
> {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
> library over, if it’s not on LD_LIBRARY_PATH}
>
>
> To extract IR right before inlining, we hacked on top of the ThinLTO
> infrastructure, interrupting its pre-link pipeline right before inlining.
> This means clang produces binary IR files captured at that stage. We then
> built a large target, obtaining a corpus of ~25K modules. We intend to
> provide a clean mechanism in a subsequent change.
>
> To obtain features/labels for training this “IR to Native Size” model, we
> had to make some changes to the AsmPrinter (to get real native sizes) and
> llvm-readobj, as well as add an analysis pass for extracting the IR
> features for this model. We plan on upstreaming these changes subsequently.
>
> Finally, the infrastructure driving the policy training is currently built
> on proprietary APIs, as it benefits from a distributed computing
> infrastructure. We are currently evaluating options for open sourcing it.
> In the meantime, we are presenting the high level implementation details.
>
> To train a new model, the infrastructure performs 2 steps: extracting the
> logs, and using them in a training algorithm.
>
> Log extraction is highly parallelizable: for each IR module in the
> training corpus, we need to run opt once (or a few times, when we explore
> improvements). Specifically, each run is this invocation:
>
> opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
> ir2native> -training-log=<path to training log output>
-enable-ml-inliner
> -mandatory-inlinings-first -o <output> <module.o>
>
> Then collect the logs, and pass them to the next step.
>
>
> Experimental results
>
> Experimental results are available as follows:
>
>
>    - SPEC2006
>   
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
>    binary sizes (-Oz) and ‘test run’ scores.
>    -
>
>    Size report
>   
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
>    from some internal benchmarks and binaries, including opt and clang
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200412/436fd5d0/attachment.html>

llvm dev - Apr 2020 - RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM