Mircea Trofin via llvm-dev
2020-Apr-08 21:23 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
It turns out it's me, sorry. Let me see how I can sort this out. In the meantime, here is the csv: SPEC2006 data: binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz scores,perf: ML -Oz scores,ML improvement by 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% ,,,,,,, SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% various benchmarks: binary,base -Oz size,ML -Oz size,ML shrink over base llvm:opt,66318624,62124256,6.32% math_1,6181552,5884144,4.81% event_management_1,4998728,4802696,3.92% llvm:lcalsBRaw,1270168,1222168,3.78% llvm:lcalsBLambda,1270232,1222232,3.78% llvm:lcalsARaw,1276248,1228248,3.76% llvm:lcalsALambda,1276440,1228504,3.76% llvm:lcalsCRaw,1278936,1231000,3.75% llvm:lcalsCLambda,1279064,1231256,3.74% llvm:Blur,1236888,1191192,3.69% image_processing_diffusion,1236120,1190488,3.69% llvm:Interpolation,1237208,1191576,3.69% llvm:harris,1237208,1191768,3.67% llvm:Dither,1238232,1192792,3.67% event_management_2,4741096,4568584,3.64% hashing_1,5189360,5004176,3.57% compression_2,5329392,5140496,3.54% math_2,6072336,5858992,3.51% crypto_1,4721576,4556008,3.51% math_4,27720208,26755568,3.48% math_3,27732496,26767920,3.48% infra_1,30673232,29630464,3.40% hashing_2,5834544,5637168,3.38% image_processing_entropy,5854960,5657008,3.38% hashing_3,5831088,5634288,3.38% memory_management_1,4957960,4790632,3.37% hasing_4,5816048,5619888,3.37% data_storage_2,5852976,5655792,3.37% compression_1,5971984,5771120,3.36% arena_unittest,6672944,6453584,3.29% data_storage_3,44113232,42710960,3.18% data_storage_1,22858992,22132960,3.18% datastructures_1,6362032,6161264,3.16% infra_2,56828720,55053496,3.12% datastructures_2,164558704,160790320,2.29% llvm:clang,77027416,75672088,1.76% benchmark_infra,20221424,19944432,1.37% search_1,169475360,167645632,1.08% protobuf_3,24071504,23895440,0.73% protobuf_1,24071504,23895504,0.73% protobuf_2,24071504,23895504,0.73% protobuf_4,24071504,23895504,0.73% protobuf_5,24071504,23895504,0.73% protobuf_6,23982800,23809424,0.72% On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <johannesdoerfert at gmail.com> wrote:> > Cool! > > I skipped to the end and tried to access the gdoc and the spreadsheet > but it did tell me I need permissions. Can you make them accessible or > am I the problem? > > Thanks, > Johannes > > On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: > > TL;DR; We can improve compiler optimizations driven by heuristics by > > replacing those heuristics with machine-learned policies (ML models). > > Policies are trained offline and ship as part of the compiler. > Determinism > > is maintained because models are fixed when the compiler is operating in > > production. Fine-tuning or regressions may be handled by > incorporating the > > interesting cases in the ML training set, retraining the compiler, and > > redeploying it. > > > > For a first milestone, we chose inlining for size (-Oz) on X86-64. We > were > > able to train an ML model to produce binaries 1.5-6% smaller than -Oz of > > tip-of-tree. The trained model appears to generalize well over a diverse > > set of binaries. Compile time is increased marginally (under 5%). The > model > > also happens to produce slightly better - performing code under > SPEC2006 - > > total score improvement by 1.75%. As we only wanted to verify there is > no > > significant regression in SPEC, and given the milestone goals, we > haven’t > > dug any deeper into the speed results. > > > > We see these results as promising, and as a reasonable point for > > contributing our current work as a build-time opt-in to LLVM to > benefit the > > community, in the hope of fostering collaboration and learning from the > > community’s feedback, as we try to better understand the trade-offs > such an > > approach entails, and as we work on expanding the depth and breadth of > > applying these techniques to compiler optimization problems. > > > > https://reviews.llvm.org/D77752 > > Motivation > > > > Optimization problems, such as inlining or register allocation, are > hard: > > we don’t know of algorithms that are guaranteed to produce the optimum > > solution in all cases in a timely fashion. Instead, we use heuristics: > > algorithms that we expect to produce some approximation of the optimum, > > with some expected degree of generality, in a practical amount of time. > > > > Heuristics have some common characteristics. Taking inlining as a case > > study, it traverses the problem space in some way (bottom-up traversal > of > > the SCC graph), extracts some properties (let’s call them “features”) of > > the program being optimized, and combine them with some weights > (“tune”), > > to produce a cost (InlineCost), which allows for trade-off analysis. We > > validate the effectiveness of a heuristic over some accepted set of > > benchmarks. Over time, we react to regressions or pathological cases > > observed in the field, by manually analyzing such cases, figuring out an > > enhancement to the heuristic, and then re-validating over that set of > > benchmarks (maybe augmented by adding the newly found cases). > > > > Because heuristics are code that needs to be maintained, there is > pressure > > to reduce complexity: adding more features means we need to reason about > > the interactions between the old and new features, which scales > > combinatorially. Re-tuning because of the new features adds a similar > kind > > of complexity. Potentially, we miss out on optimization improvements as > a > > result. > > > > Because tuning is manual, there is pressure to keep the number of > > benchmarks that can be studied in depth to a humanly-manageable size, > which > > potentially affects the generality of a heuristic or heuristic tuning. > > > > The main advantage of manual heuristics is arguably their relatively > lower > > overhead: no additional dependencies and more transparent to human > analysis > > and improvement. > > > > Machine learning, in particular reinforcement learning, can address the > > tensions found in manual heuristics: once features are extracted from > the > > program, the way they are combined and tuned can easily be scaled up > > through automation, improving effectiveness and generality. A major > > drawback, at least at this point in time, of machine learning, is that > we > > don’t yet have a fully developed systematic approach for improving > policy > > effectiveness. > > High level design > > > > We identify two scenarios for a compiler using ML policies: > development and > > release. > > > > The release scenario is equivalent to the regular compilation we have > today > > - the only difference is that it uses a pre-trained model (trained in > the > > development scenario beforehand) to make decisions instead of the > > heuristics. Determinism is guaranteed since the model in the release > > scenario is fixed. We imagine teams wishing to fine tune the > effectiveness > > of the optimization to their scenarios would train a different model. > > > > The decision previously evaluated using a human-crafted heuristic is > > optionally replaced by: > > > > - > > > > a compiler-specific component, extracting features from IR (i.e. a > > vector of values) > > - > > > > an evaluation of an ML model using those features, to obtain a > result. > > In ML nomenclature, this is referred to using the model for > inference (as > > opposed to training it) > > > > For example, when we replaced the decision of whether to inline a > callsite, > > the ML model produces a boolean (inline/don’t inline) based on a > features > > vector characterizing the call site and some broader module-wide > context. > > > > Training/development is more complicated, and happens offline - akin to > > how, today, attempts to improve an optimizing pass also happen offline. > A > > description of the high level design and the specifics we used for the > > current scope are given in Appendix. > > Current Scope > > > > The goal of our first milestone was to evaluate end to end an > integration > > of ML with LLVM, and get a first promising result. To that end, we chose > > inlining for size (-Oz) as a stepping stone, as we perceived it to be > more > > likely to require a simpler evaluation setup than performance-oriented > > optimizations might. At this point, we only train whether a call site > may > > be inlined or not, leaving the SCC traversal order as-is. > > > > We are proposing an initial change demonstrating the inference mechanism > > using a pre-trained model, as a build-time opt-in to llvm. The compiler > > components needed to perform training are also included in this first > > change. Subsequent changes would include more training-related > components. > > > > At a high level, the changes we are proposing consist of: > > > > 1. > > > > a new module analysis pass, InliningAdvisor. By default, its > > implementation does nothing. > > 2. > > > > minimal hooks into Inliner.cpp. > > 3. > > > > the implementation of InliningAdvisor, activated when we opt-in > ML. This > > is available in Analysis/ML, together with: > > 1. > > > > Rel/Dev specific ML model handing, also under Analysis/ML > > 2. > > > > a pre-trained model for inlining for size > > (Analysis/ML/models/inlining) > > 3. > > > > a pre-trained model for predicting native size from IR > > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. > > 4. > > > > Some refactorings in PassBuilder, to allow opting into running > mandatory > > inlining first - some compilation speedup for the ML case, minimal, > > noise-like size effect. Also simplifies testing (these would be > introduced > > as a preliminary patch). > > > > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is > more > > involved. > > Inference Opt-In Mechanism > > > > The feature is primarily controlled by the cmake flag > > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The > > “Rel”ease case requires specifying the location of the pip tensorflow > > package (currently, that’s tf_nightly, and it should soon be available > in > > tensorflow) > > > > To opt in the ‘Rel’ case: > > > > 1. > > > > install tensorflow pip package > > > > pip3 install tf_nightly --user > > > > 1. > > > > configure llvm build > > > > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ > > > > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \ > > > > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s > > usually in the path>} > > > > > > > > 1. > > > > build llvm as usual. > > 2. > > > > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to > > clang. > > > > Details > > > > The ML model is captured as a TensorFlow ‘saved model’. When building > llvm, > > we use TensorFlow’s XLA native compiler (saved_model_cli) to compile > the > > saved model into a native static library and a header file. Insofar > as LLVM > > is concerned, there are minimal additional runtime requirements, > packaged > > as source within the pip package: C++ wrappers around the compiled > model. > > These will also be statically linked in the LLVM target. The compiled > code > > is otherwise just a complex arithmetical computation, with no special > > requirements - it is single threaded and runs natively on the targeted > > architecture. Together with the aforementioned runtime dependencies, it > > adds ~115KB to the clang binary (0.08% increase) > > > > Runtime-wise, we observed a ~10% increase in the time spent in the > inliner, > > for a large (33MB) binary IR module; inlining typically consumes > ~10-15% of > > total compilation time, so the overall compile time overhead of the > > approach is arguably negligible. This cost is almost in entirety > > attributable to feature extraction. > > > > Memory-wise, the precompiled model has a fixed size buffer for its > inputs, > > and performs a fixed amount of computations, so the memory overhead > > inherent to our approach is independent from the program being > optimized. > > Using a small example to avoid effects such as memory use differences > due > > to different inlinings, we observed an 300KB increase in the maximum > > resident size. > > > > A table showing effect on -Oz compiled binaries’ size is given in > Appendix. > > Next directions > > > > Our next milestone has two main high level goals: detailing a systematic > > approach to driving policy effectiveness; and exploring in depth the > type > > of features and training algorithms most appropriate for compiler > problems, > > or at least problems like inlining. For the latter, we expect embedding > > more of the call graph structure to play an important role, as well as, > > potentially, delegating the problem space traversal to the ML model. > > > > We plan to include inlining for speed as part of our work on these > goals. > > AppendixTraining - High Level > > > > We use Reinforcement Learning (RL) to train the Inline-for-size > model. At a > > high level, it is composed of 3 parts: training data collection, model > > training, and iterative data collection/model training. We use > TensorFlow > > as our ML framework. > > > > Related, we also needed to learn a separate model to evaluate the native > > size of a function, given its IR, in order to calculate a more precise > > reward for the reinforcement learning algorithm (“IR2Native”). We > evaluated > > ‘just counting IR’ and TargetTransformInfo, but they appeared to provide > > too noisy of a signal for the reward, insofar as the RL training > algorithm > > for the inlining model was concerned. This model is only used during > > training. > > > > RL - Training data collection: the training data we need to feed into a > > reinforcement learning algorithm are sequences consisting of: state > of the > > problem (i.e. features); action (inline/not inline), and reward (native > > size shrinkage after inline/not inline, using ir2native). To collect the > > sequences, we hook the logging infrastructure into LLVM Inliner that is > > able to produce logs after the inline optimization pass. > > > > RL - Model training: We use DQN (Deep Q-Network) to train our > > inlining-for-size ML policy. On a high level, the DQN algorithm trains a > > neural network to predict the value of different actions --- the DQN > policy > > then chooses to take the action with the highest predicted value. In our > > scenario, we have two actions: 1) inline; 2) not inline, so the neural > > network predicts the size reduction of these two actions based on > features, > > and then decides to conduct inlining if the neural network believes > doing > > inlining leads to higher size reduction. After the training finishes, it > > produces a TensorFlow SavedModel that takes features as input and > generates > > inline decisions (whether to inline or not) as output. > > > > The choice of the features and reward are essential in model > training. The > > features are chosen with the consideration of being helpful in making > the > > decision --- the input to the cost model is a good starting point. > Ideally, > > the reward in the inline-for-size problem is the native size shrinkage > > after inline/not inline. It is difficult to obtain precisely, so we > use the > > estimate as an alternative. This means that, for training, we also need > a > > model (not necessarily machine learned) for estimating rewards. > > > > RL - Iterative data collection/model training: Reinforcement learning is > > ideally an iterative model/policy improvement process that: 1) the > trained > > model is deployed to the field to collect new data; 2) newly > collected data > > are used to update the model. Thus, we need to do iterative data > > collection/model training. To facilitate data collection (avoid complex > > build dependencies and time spent before/after the pass being > trained), we > > isolate out IR modules captured right before the optimization we are > > interested in, and iterate on them with opt. We bootstrap the > training from > > the current heuristic, and stop the process once we are happy with the > > outcome. > > > > IR2Native: We collect IR features (different from the ones used for > > inlining) for each function at the end of inlining, right before we > perform > > function simplification passes, and right after. This means we have > two IR > > ‘shapes’ of the same function, and we know no further inlinings will be > > performed, so whatever changes happen are based on that IR. We then > extract > > the native size at the end of compilation. Together, this data forms two > > records per function that can be used in supervised learning - the > features > > are those extracted from IR, and the label is the native size. Training > > IR2Native happens independently from the training of the inliner model. > > Training support for the current scope > > > > The initial change includes the logging mechanism, an ir2native model > > trained for x86-64, and the means to rapidly iterate over development ML > > models. For the components that will be included in subsequent > changes, the > > rest of this section describes the mechanisms we employed. These > components > > are detailed further below. > > > > To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C > > API library <https://www.tensorflow.org/install/lang_c>. We then > configure > > the build: > > > > cmake .. -DLLVM_USE_ML_POLICY=Dev \ > > > > -DLLVM_TF_C_LIB=<path to unarchived package> \ > > > > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared > > library over, if it’s not on LD_LIBRARY_PATH} > > > > > > To extract IR right before inlining, we hacked on top of the ThinLTO > > infrastructure, interrupting its pre-link pipeline right before > inlining. > > This means clang produces binary IR files captured at that stage. We > then > > built a large target, obtaining a corpus of ~25K modules. We intend to > > provide a clean mechanism in a subsequent change. > > > > To obtain features/labels for training this “IR to Native Size” model, > we > > had to make some changes to the AsmPrinter (to get real native sizes) > and > > llvm-readobj, as well as add an analysis pass for extracting the IR > > features for this model. We plan on upstreaming these changes > subsequently. > > > > Finally, the infrastructure driving the policy training is currently > built > > on proprietary APIs, as it benefits from a distributed computing > > infrastructure. We are currently evaluating options for open sourcing > it. > > In the meantime, we are presenting the high level implementation > details. > > > > To train a new model, the infrastructure performs 2 steps: extracting > the > > logs, and using them in a training algorithm. > > > > Log extraction is highly parallelizable: for each IR module in the > training > > corpus, we need to run opt once (or a few times, when we explore > > improvements). Specifically, each run is this invocation: > > > > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to > > ir2native> -training-log=<path to training log output> > -enable-ml-inliner > > -mandatory-inlinings-first -o <output> <module.o> > > > > Then collect the logs, and pass them to the next step. > > > > > > Experimental results > > > > Experimental results are available as follows: > > > > > > - > > > > SPEC2006 > > > < > https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true > > > > binary sizes (-Oz) and ‘test run’ scores. > > - > > > > Size report > > > < > https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true > > > > from some internal benchmarks and binaries, including opt and clang > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/bd2f145e/attachment-0001.html>
Mircea Trofin via llvm-dev
2020-Apr-08 21:34 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
Unfortunately I needed to update the links. Here they are, hopefully these work correctly. SPEC2006 <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> Internal benchmarks, clang, opt <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> Thanks! On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote:> It turns out it's me, sorry. Let me see how I can sort this out. In the > meantime, here is the csv: > > SPEC2006 data: > > binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz > scores,perf: ML -Oz scores,ML improvement by > 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% > 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% > 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% > 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% > 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% > 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% > 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% > 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% > 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% > 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% > 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% > 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% > 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% > 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% > 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% > 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% > 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% > 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% > 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% > ,,,,,,, > SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% > > > various benchmarks: > > binary,base -Oz size,ML -Oz size,ML shrink over base > llvm:opt,66318624,62124256,6.32% > math_1,6181552,5884144,4.81% > event_management_1,4998728,4802696,3.92% > llvm:lcalsBRaw,1270168,1222168,3.78% > llvm:lcalsBLambda,1270232,1222232,3.78% > llvm:lcalsARaw,1276248,1228248,3.76% > llvm:lcalsALambda,1276440,1228504,3.76% > llvm:lcalsCRaw,1278936,1231000,3.75% > llvm:lcalsCLambda,1279064,1231256,3.74% > llvm:Blur,1236888,1191192,3.69% > image_processing_diffusion,1236120,1190488,3.69% > llvm:Interpolation,1237208,1191576,3.69% > llvm:harris,1237208,1191768,3.67% > llvm:Dither,1238232,1192792,3.67% > event_management_2,4741096,4568584,3.64% > hashing_1,5189360,5004176,3.57% > compression_2,5329392,5140496,3.54% > math_2,6072336,5858992,3.51% > crypto_1,4721576,4556008,3.51% > math_4,27720208,26755568,3.48% > math_3,27732496,26767920,3.48% > infra_1,30673232,29630464,3.40% > hashing_2,5834544,5637168,3.38% > image_processing_entropy,5854960,5657008,3.38% > hashing_3,5831088,5634288,3.38% > memory_management_1,4957960,4790632,3.37% > hasing_4,5816048,5619888,3.37% > data_storage_2,5852976,5655792,3.37% > compression_1,5971984,5771120,3.36% > arena_unittest,6672944,6453584,3.29% > data_storage_3,44113232,42710960,3.18% > data_storage_1,22858992,22132960,3.18% > datastructures_1,6362032,6161264,3.16% > infra_2,56828720,55053496,3.12% > datastructures_2,164558704,160790320,2.29% > llvm:clang,77027416,75672088,1.76% > benchmark_infra,20221424,19944432,1.37% > search_1,169475360,167645632,1.08% > protobuf_3,24071504,23895440,0.73% > protobuf_1,24071504,23895504,0.73% > protobuf_2,24071504,23895504,0.73% > protobuf_4,24071504,23895504,0.73% > protobuf_5,24071504,23895504,0.73% > protobuf_6,23982800,23809424,0.72% > > On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < > johannesdoerfert at gmail.com> wrote: > >> >> Cool! >> >> I skipped to the end and tried to access the gdoc and the spreadsheet >> but it did tell me I need permissions. Can you make them accessible or >> am I the problem? >> >> Thanks, >> Johannes >> >> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >> > TL;DR; We can improve compiler optimizations driven by heuristics by >> > replacing those heuristics with machine-learned policies (ML models). >> > Policies are trained offline and ship as part of the compiler. >> Determinism >> > is maintained because models are fixed when the compiler is operating >> in >> > production. Fine-tuning or regressions may be handled by >> incorporating the >> > interesting cases in the ML training set, retraining the compiler, and >> > redeploying it. >> > >> > For a first milestone, we chose inlining for size (-Oz) on X86-64. We >> were >> > able to train an ML model to produce binaries 1.5-6% smaller than -Oz >> of >> > tip-of-tree. The trained model appears to generalize well over a >> diverse >> > set of binaries. Compile time is increased marginally (under 5%). The >> model >> > also happens to produce slightly better - performing code under >> SPEC2006 - >> > total score improvement by 1.75%. As we only wanted to verify there is >> no >> > significant regression in SPEC, and given the milestone goals, we >> haven’t >> > dug any deeper into the speed results. >> > >> > We see these results as promising, and as a reasonable point for >> > contributing our current work as a build-time opt-in to LLVM to >> benefit the >> > community, in the hope of fostering collaboration and learning from the >> > community’s feedback, as we try to better understand the trade-offs >> such an >> > approach entails, and as we work on expanding the depth and breadth of >> > applying these techniques to compiler optimization problems. >> > >> > https://reviews.llvm.org/D77752 >> > Motivation >> > >> > Optimization problems, such as inlining or register allocation, are >> hard: >> > we don’t know of algorithms that are guaranteed to produce the optimum >> > solution in all cases in a timely fashion. Instead, we use heuristics: >> > algorithms that we expect to produce some approximation of the optimum, >> > with some expected degree of generality, in a practical amount of time. >> > >> > Heuristics have some common characteristics. Taking inlining as a case >> > study, it traverses the problem space in some way (bottom-up traversal >> of >> > the SCC graph), extracts some properties (let’s call them “features”) >> of >> > the program being optimized, and combine them with some weights >> (“tune”), >> > to produce a cost (InlineCost), which allows for trade-off analysis. We >> > validate the effectiveness of a heuristic over some accepted set of >> > benchmarks. Over time, we react to regressions or pathological cases >> > observed in the field, by manually analyzing such cases, figuring out >> an >> > enhancement to the heuristic, and then re-validating over that set of >> > benchmarks (maybe augmented by adding the newly found cases). >> > >> > Because heuristics are code that needs to be maintained, there is >> pressure >> > to reduce complexity: adding more features means we need to reason >> about >> > the interactions between the old and new features, which scales >> > combinatorially. Re-tuning because of the new features adds a similar >> kind >> > of complexity. Potentially, we miss out on optimization improvements >> as a >> > result. >> > >> > Because tuning is manual, there is pressure to keep the number of >> > benchmarks that can be studied in depth to a humanly-manageable size, >> which >> > potentially affects the generality of a heuristic or heuristic tuning. >> > >> > The main advantage of manual heuristics is arguably their relatively >> lower >> > overhead: no additional dependencies and more transparent to human >> analysis >> > and improvement. >> > >> > Machine learning, in particular reinforcement learning, can address the >> > tensions found in manual heuristics: once features are extracted from >> the >> > program, the way they are combined and tuned can easily be scaled up >> > through automation, improving effectiveness and generality. A major >> > drawback, at least at this point in time, of machine learning, is that >> we >> > don’t yet have a fully developed systematic approach for improving >> policy >> > effectiveness. >> > High level design >> > >> > We identify two scenarios for a compiler using ML policies: >> development and >> > release. >> > >> > The release scenario is equivalent to the regular compilation we have >> today >> > - the only difference is that it uses a pre-trained model (trained in >> the >> > development scenario beforehand) to make decisions instead of the >> > heuristics. Determinism is guaranteed since the model in the release >> > scenario is fixed. We imagine teams wishing to fine tune the >> effectiveness >> > of the optimization to their scenarios would train a different model. >> > >> > The decision previously evaluated using a human-crafted heuristic is >> > optionally replaced by: >> > >> > - >> > >> > a compiler-specific component, extracting features from IR (i.e. a >> > vector of values) >> > - >> > >> > an evaluation of an ML model using those features, to obtain a >> result. >> > In ML nomenclature, this is referred to using the model for >> inference (as >> > opposed to training it) >> > >> > For example, when we replaced the decision of whether to inline a >> callsite, >> > the ML model produces a boolean (inline/don’t inline) based on a >> features >> > vector characterizing the call site and some broader module-wide >> context. >> > >> > Training/development is more complicated, and happens offline - akin to >> > how, today, attempts to improve an optimizing pass also happen >> offline. A >> > description of the high level design and the specifics we used for the >> > current scope are given in Appendix. >> > Current Scope >> > >> > The goal of our first milestone was to evaluate end to end an >> integration >> > of ML with LLVM, and get a first promising result. To that end, we >> chose >> > inlining for size (-Oz) as a stepping stone, as we perceived it to be >> more >> > likely to require a simpler evaluation setup than performance-oriented >> > optimizations might. At this point, we only train whether a call site >> may >> > be inlined or not, leaving the SCC traversal order as-is. >> > >> > We are proposing an initial change demonstrating the inference >> mechanism >> > using a pre-trained model, as a build-time opt-in to llvm. The compiler >> > components needed to perform training are also included in this first >> > change. Subsequent changes would include more training-related >> components. >> > >> > At a high level, the changes we are proposing consist of: >> > >> > 1. >> > >> > a new module analysis pass, InliningAdvisor. By default, its >> > implementation does nothing. >> > 2. >> > >> > minimal hooks into Inliner.cpp. >> > 3. >> > >> > the implementation of InliningAdvisor, activated when we opt-in >> ML. This >> > is available in Analysis/ML, together with: >> > 1. >> > >> > Rel/Dev specific ML model handing, also under Analysis/ML >> > 2. >> > >> > a pre-trained model for inlining for size >> > (Analysis/ML/models/inlining) >> > 3. >> > >> > a pre-trained model for predicting native size from IR >> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. >> > 4. >> > >> > Some refactorings in PassBuilder, to allow opting into running >> mandatory >> > inlining first - some compilation speedup for the ML case, minimal, >> > noise-like size effect. Also simplifies testing (these would be >> introduced >> > as a preliminary patch). >> > >> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is >> more >> > involved. >> > Inference Opt-In Mechanism >> > >> > The feature is primarily controlled by the cmake flag >> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The >> > “Rel”ease case requires specifying the location of the pip tensorflow >> > package (currently, that’s tf_nightly, and it should soon be available >> in >> > tensorflow) >> > >> > To opt in the ‘Rel’ case: >> > >> > 1. >> > >> > install tensorflow pip package >> > >> > pip3 install tf_nightly --user >> > >> > 1. >> > >> > configure llvm build >> > >> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >> > >> > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \ >> > >> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s >> > usually in the path>} >> > >> > >> > >> > 1. >> > >> > build llvm as usual. >> > 2. >> > >> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to >> > clang. >> > >> > Details >> > >> > The ML model is captured as a TensorFlow ‘saved model’. When building >> llvm, >> > we use TensorFlow’s XLA native compiler (saved_model_cli) to compile >> the >> > saved model into a native static library and a header file. Insofar >> as LLVM >> > is concerned, there are minimal additional runtime requirements, >> packaged >> > as source within the pip package: C++ wrappers around the compiled >> model. >> > These will also be statically linked in the LLVM target. The compiled >> code >> > is otherwise just a complex arithmetical computation, with no special >> > requirements - it is single threaded and runs natively on the targeted >> > architecture. Together with the aforementioned runtime dependencies, it >> > adds ~115KB to the clang binary (0.08% increase) >> > >> > Runtime-wise, we observed a ~10% increase in the time spent in the >> inliner, >> > for a large (33MB) binary IR module; inlining typically consumes >> ~10-15% of >> > total compilation time, so the overall compile time overhead of the >> > approach is arguably negligible. This cost is almost in entirety >> > attributable to feature extraction. >> > >> > Memory-wise, the precompiled model has a fixed size buffer for its >> inputs, >> > and performs a fixed amount of computations, so the memory overhead >> > inherent to our approach is independent from the program being >> optimized. >> > Using a small example to avoid effects such as memory use differences >> due >> > to different inlinings, we observed an 300KB increase in the maximum >> > resident size. >> > >> > A table showing effect on -Oz compiled binaries’ size is given in >> Appendix. >> > Next directions >> > >> > Our next milestone has two main high level goals: detailing a >> systematic >> > approach to driving policy effectiveness; and exploring in depth the >> type >> > of features and training algorithms most appropriate for compiler >> problems, >> > or at least problems like inlining. For the latter, we expect embedding >> > more of the call graph structure to play an important role, as well as, >> > potentially, delegating the problem space traversal to the ML model. >> > >> > We plan to include inlining for speed as part of our work on these >> goals. >> > AppendixTraining - High Level >> > >> > We use Reinforcement Learning (RL) to train the Inline-for-size >> model. At a >> > high level, it is composed of 3 parts: training data collection, model >> > training, and iterative data collection/model training. We use >> TensorFlow >> > as our ML framework. >> > >> > Related, we also needed to learn a separate model to evaluate the >> native >> > size of a function, given its IR, in order to calculate a more precise >> > reward for the reinforcement learning algorithm (“IR2Native”). We >> evaluated >> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >> provide >> > too noisy of a signal for the reward, insofar as the RL training >> algorithm >> > for the inlining model was concerned. This model is only used during >> > training. >> > >> > RL - Training data collection: the training data we need to feed into a >> > reinforcement learning algorithm are sequences consisting of: state >> of the >> > problem (i.e. features); action (inline/not inline), and reward (native >> > size shrinkage after inline/not inline, using ir2native). To collect >> the >> > sequences, we hook the logging infrastructure into LLVM Inliner that is >> > able to produce logs after the inline optimization pass. >> > >> > RL - Model training: We use DQN (Deep Q-Network) to train our >> > inlining-for-size ML policy. On a high level, the DQN algorithm trains >> a >> > neural network to predict the value of different actions --- the DQN >> policy >> > then chooses to take the action with the highest predicted value. In >> our >> > scenario, we have two actions: 1) inline; 2) not inline, so the neural >> > network predicts the size reduction of these two actions based on >> features, >> > and then decides to conduct inlining if the neural network believes >> doing >> > inlining leads to higher size reduction. After the training finishes, >> it >> > produces a TensorFlow SavedModel that takes features as input and >> generates >> > inline decisions (whether to inline or not) as output. >> > >> > The choice of the features and reward are essential in model >> training. The >> > features are chosen with the consideration of being helpful in making >> the >> > decision --- the input to the cost model is a good starting point. >> Ideally, >> > the reward in the inline-for-size problem is the native size shrinkage >> > after inline/not inline. It is difficult to obtain precisely, so we >> use the >> > estimate as an alternative. This means that, for training, we also >> need a >> > model (not necessarily machine learned) for estimating rewards. >> > >> > RL - Iterative data collection/model training: Reinforcement learning >> is >> > ideally an iterative model/policy improvement process that: 1) the >> trained >> > model is deployed to the field to collect new data; 2) newly >> collected data >> > are used to update the model. Thus, we need to do iterative data >> > collection/model training. To facilitate data collection (avoid complex >> > build dependencies and time spent before/after the pass being >> trained), we >> > isolate out IR modules captured right before the optimization we are >> > interested in, and iterate on them with opt. We bootstrap the >> training from >> > the current heuristic, and stop the process once we are happy with the >> > outcome. >> > >> > IR2Native: We collect IR features (different from the ones used for >> > inlining) for each function at the end of inlining, right before we >> perform >> > function simplification passes, and right after. This means we have >> two IR >> > ‘shapes’ of the same function, and we know no further inlinings will be >> > performed, so whatever changes happen are based on that IR. We then >> extract >> > the native size at the end of compilation. Together, this data forms >> two >> > records per function that can be used in supervised learning - the >> features >> > are those extracted from IR, and the label is the native size. Training >> > IR2Native happens independently from the training of the inliner model. >> > Training support for the current scope >> > >> > The initial change includes the logging mechanism, an ir2native model >> > trained for x86-64, and the means to rapidly iterate over development >> ML >> > models. For the components that will be included in subsequent >> changes, the >> > rest of this section describes the mechanisms we employed. These >> components >> > are detailed further below. >> > >> > To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow >> C >> > API library <https://www.tensorflow.org/install/lang_c>. We then >> configure >> > the build: >> > >> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >> > >> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >> > >> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >> > library over, if it’s not on LD_LIBRARY_PATH} >> > >> > >> > To extract IR right before inlining, we hacked on top of the ThinLTO >> > infrastructure, interrupting its pre-link pipeline right before >> inlining. >> > This means clang produces binary IR files captured at that stage. We >> then >> > built a large target, obtaining a corpus of ~25K modules. We intend to >> > provide a clean mechanism in a subsequent change. >> > >> > To obtain features/labels for training this “IR to Native Size” model, >> we >> > had to make some changes to the AsmPrinter (to get real native sizes) >> and >> > llvm-readobj, as well as add an analysis pass for extracting the IR >> > features for this model. We plan on upstreaming these changes >> subsequently. >> > >> > Finally, the infrastructure driving the policy training is currently >> built >> > on proprietary APIs, as it benefits from a distributed computing >> > infrastructure. We are currently evaluating options for open sourcing >> it. >> > In the meantime, we are presenting the high level implementation >> details. >> > >> > To train a new model, the infrastructure performs 2 steps: extracting >> the >> > logs, and using them in a training algorithm. >> > >> > Log extraction is highly parallelizable: for each IR module in the >> training >> > corpus, we need to run opt once (or a few times, when we explore >> > improvements). Specifically, each run is this invocation: >> > >> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to >> > ir2native> -training-log=<path to training log output> >> -enable-ml-inliner >> > -mandatory-inlinings-first -o <output> <module.o> >> > >> > Then collect the logs, and pass them to the next step. >> > >> > >> > Experimental results >> > >> > Experimental results are available as follows: >> > >> > >> > - >> > >> > SPEC2006 >> > >> < >> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >> > >> > binary sizes (-Oz) and ‘test run’ scores. >> > - >> > >> > Size report >> > >> < >> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >> > >> > from some internal benchmarks and binaries, including opt and clang >> > >> > >> > _______________________________________________ >> > LLVM Developers mailing list >> > llvm-dev at lists.llvm.org >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/2fc26df2/attachment.html>
Mircea Trofin via llvm-dev
2020-Apr-09 02:05 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
+Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> , our team members from the ML side. To avoid formatting issues, here is a link to the RFC <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>, open to comments. Thanks! On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com> wrote:> Unfortunately I needed to update the links. Here they are, hopefully these > work correctly. > > SPEC2006 > <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> > > Internal benchmarks, clang, opt > <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> > > Thanks! > > On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote: > >> It turns out it's me, sorry. Let me see how I can sort this out. In the >> meantime, here is the csv: >> >> SPEC2006 data: >> >> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz >> scores,perf: ML -Oz scores,ML improvement by >> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% >> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% >> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% >> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% >> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% >> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% >> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% >> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% >> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% >> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% >> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% >> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% >> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% >> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% >> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% >> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% >> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% >> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% >> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% >> ,,,,,,, >> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% >> >> >> various benchmarks: >> >> binary,base -Oz size,ML -Oz size,ML shrink over base >> llvm:opt,66318624,62124256,6.32% >> math_1,6181552,5884144,4.81% >> event_management_1,4998728,4802696,3.92% >> llvm:lcalsBRaw,1270168,1222168,3.78% >> llvm:lcalsBLambda,1270232,1222232,3.78% >> llvm:lcalsARaw,1276248,1228248,3.76% >> llvm:lcalsALambda,1276440,1228504,3.76% >> llvm:lcalsCRaw,1278936,1231000,3.75% >> llvm:lcalsCLambda,1279064,1231256,3.74% >> llvm:Blur,1236888,1191192,3.69% >> image_processing_diffusion,1236120,1190488,3.69% >> llvm:Interpolation,1237208,1191576,3.69% >> llvm:harris,1237208,1191768,3.67% >> llvm:Dither,1238232,1192792,3.67% >> event_management_2,4741096,4568584,3.64% >> hashing_1,5189360,5004176,3.57% >> compression_2,5329392,5140496,3.54% >> math_2,6072336,5858992,3.51% >> crypto_1,4721576,4556008,3.51% >> math_4,27720208,26755568,3.48% >> math_3,27732496,26767920,3.48% >> infra_1,30673232,29630464,3.40% >> hashing_2,5834544,5637168,3.38% >> image_processing_entropy,5854960,5657008,3.38% >> hashing_3,5831088,5634288,3.38% >> memory_management_1,4957960,4790632,3.37% >> hasing_4,5816048,5619888,3.37% >> data_storage_2,5852976,5655792,3.37% >> compression_1,5971984,5771120,3.36% >> arena_unittest,6672944,6453584,3.29% >> data_storage_3,44113232,42710960,3.18% >> data_storage_1,22858992,22132960,3.18% >> datastructures_1,6362032,6161264,3.16% >> infra_2,56828720,55053496,3.12% >> datastructures_2,164558704,160790320,2.29% >> llvm:clang,77027416,75672088,1.76% >> benchmark_infra,20221424,19944432,1.37% >> search_1,169475360,167645632,1.08% >> protobuf_3,24071504,23895440,0.73% >> protobuf_1,24071504,23895504,0.73% >> protobuf_2,24071504,23895504,0.73% >> protobuf_4,24071504,23895504,0.73% >> protobuf_5,24071504,23895504,0.73% >> protobuf_6,23982800,23809424,0.72% >> >> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < >> johannesdoerfert at gmail.com> wrote: >> >>> >>> Cool! >>> >>> I skipped to the end and tried to access the gdoc and the spreadsheet >>> but it did tell me I need permissions. Can you make them accessible or >>> am I the problem? >>> >>> Thanks, >>> Johannes >>> >>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >>> > TL;DR; We can improve compiler optimizations driven by heuristics by >>> > replacing those heuristics with machine-learned policies (ML models). >>> > Policies are trained offline and ship as part of the compiler. >>> Determinism >>> > is maintained because models are fixed when the compiler is operating >>> in >>> > production. Fine-tuning or regressions may be handled by >>> incorporating the >>> > interesting cases in the ML training set, retraining the compiler, and >>> > redeploying it. >>> > >>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. We >>> were >>> > able to train an ML model to produce binaries 1.5-6% smaller than -Oz >>> of >>> > tip-of-tree. The trained model appears to generalize well over a >>> diverse >>> > set of binaries. Compile time is increased marginally (under 5%). The >>> model >>> > also happens to produce slightly better - performing code under >>> SPEC2006 - >>> > total score improvement by 1.75%. As we only wanted to verify there >>> is no >>> > significant regression in SPEC, and given the milestone goals, we >>> haven’t >>> > dug any deeper into the speed results. >>> > >>> > We see these results as promising, and as a reasonable point for >>> > contributing our current work as a build-time opt-in to LLVM to >>> benefit the >>> > community, in the hope of fostering collaboration and learning from >>> the >>> > community’s feedback, as we try to better understand the trade-offs >>> such an >>> > approach entails, and as we work on expanding the depth and breadth of >>> > applying these techniques to compiler optimization problems. >>> > >>> > https://reviews.llvm.org/D77752 >>> > Motivation >>> > >>> > Optimization problems, such as inlining or register allocation, are >>> hard: >>> > we don’t know of algorithms that are guaranteed to produce the optimum >>> > solution in all cases in a timely fashion. Instead, we use heuristics: >>> > algorithms that we expect to produce some approximation of the >>> optimum, >>> > with some expected degree of generality, in a practical amount of >>> time. >>> > >>> > Heuristics have some common characteristics. Taking inlining as a case >>> > study, it traverses the problem space in some way (bottom-up >>> traversal of >>> > the SCC graph), extracts some properties (let’s call them “features”) >>> of >>> > the program being optimized, and combine them with some weights >>> (“tune”), >>> > to produce a cost (InlineCost), which allows for trade-off analysis. >>> We >>> > validate the effectiveness of a heuristic over some accepted set of >>> > benchmarks. Over time, we react to regressions or pathological cases >>> > observed in the field, by manually analyzing such cases, figuring out >>> an >>> > enhancement to the heuristic, and then re-validating over that set of >>> > benchmarks (maybe augmented by adding the newly found cases). >>> > >>> > Because heuristics are code that needs to be maintained, there is >>> pressure >>> > to reduce complexity: adding more features means we need to reason >>> about >>> > the interactions between the old and new features, which scales >>> > combinatorially. Re-tuning because of the new features adds a similar >>> kind >>> > of complexity. Potentially, we miss out on optimization improvements >>> as a >>> > result. >>> > >>> > Because tuning is manual, there is pressure to keep the number of >>> > benchmarks that can be studied in depth to a humanly-manageable size, >>> which >>> > potentially affects the generality of a heuristic or heuristic tuning. >>> > >>> > The main advantage of manual heuristics is arguably their relatively >>> lower >>> > overhead: no additional dependencies and more transparent to human >>> analysis >>> > and improvement. >>> > >>> > Machine learning, in particular reinforcement learning, can address >>> the >>> > tensions found in manual heuristics: once features are extracted from >>> the >>> > program, the way they are combined and tuned can easily be scaled up >>> > through automation, improving effectiveness and generality. A major >>> > drawback, at least at this point in time, of machine learning, is >>> that we >>> > don’t yet have a fully developed systematic approach for improving >>> policy >>> > effectiveness. >>> > High level design >>> > >>> > We identify two scenarios for a compiler using ML policies: >>> development and >>> > release. >>> > >>> > The release scenario is equivalent to the regular compilation we have >>> today >>> > - the only difference is that it uses a pre-trained model (trained in >>> the >>> > development scenario beforehand) to make decisions instead of the >>> > heuristics. Determinism is guaranteed since the model in the release >>> > scenario is fixed. We imagine teams wishing to fine tune the >>> effectiveness >>> > of the optimization to their scenarios would train a different model. >>> > >>> > The decision previously evaluated using a human-crafted heuristic is >>> > optionally replaced by: >>> > >>> > - >>> > >>> > a compiler-specific component, extracting features from IR (i.e. a >>> > vector of values) >>> > - >>> > >>> > an evaluation of an ML model using those features, to obtain a >>> result. >>> > In ML nomenclature, this is referred to using the model for >>> inference (as >>> > opposed to training it) >>> > >>> > For example, when we replaced the decision of whether to inline a >>> callsite, >>> > the ML model produces a boolean (inline/don’t inline) based on a >>> features >>> > vector characterizing the call site and some broader module-wide >>> context. >>> > >>> > Training/development is more complicated, and happens offline - akin >>> to >>> > how, today, attempts to improve an optimizing pass also happen >>> offline. A >>> > description of the high level design and the specifics we used for the >>> > current scope are given in Appendix. >>> > Current Scope >>> > >>> > The goal of our first milestone was to evaluate end to end an >>> integration >>> > of ML with LLVM, and get a first promising result. To that end, we >>> chose >>> > inlining for size (-Oz) as a stepping stone, as we perceived it to be >>> more >>> > likely to require a simpler evaluation setup than performance-oriented >>> > optimizations might. At this point, we only train whether a call site >>> may >>> > be inlined or not, leaving the SCC traversal order as-is. >>> > >>> > We are proposing an initial change demonstrating the inference >>> mechanism >>> > using a pre-trained model, as a build-time opt-in to llvm. The >>> compiler >>> > components needed to perform training are also included in this first >>> > change. Subsequent changes would include more training-related >>> components. >>> > >>> > At a high level, the changes we are proposing consist of: >>> > >>> > 1. >>> > >>> > a new module analysis pass, InliningAdvisor. By default, its >>> > implementation does nothing. >>> > 2. >>> > >>> > minimal hooks into Inliner.cpp. >>> > 3. >>> > >>> > the implementation of InliningAdvisor, activated when we opt-in >>> ML. This >>> > is available in Analysis/ML, together with: >>> > 1. >>> > >>> > Rel/Dev specific ML model handing, also under Analysis/ML >>> > 2. >>> > >>> > a pre-trained model for inlining for size >>> > (Analysis/ML/models/inlining) >>> > 3. >>> > >>> > a pre-trained model for predicting native size from IR >>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. >>> > 4. >>> > >>> > Some refactorings in PassBuilder, to allow opting into running >>> mandatory >>> > inlining first - some compilation speedup for the ML case, minimal, >>> > noise-like size effect. Also simplifies testing (these would be >>> introduced >>> > as a preliminary patch). >>> > >>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is >>> more >>> > involved. >>> > Inference Opt-In Mechanism >>> > >>> > The feature is primarily controlled by the cmake flag >>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The >>> > “Rel”ease case requires specifying the location of the pip tensorflow >>> > package (currently, that’s tf_nightly, and it should soon be >>> available in >>> > tensorflow) >>> > >>> > To opt in the ‘Rel’ case: >>> > >>> > 1. >>> > >>> > install tensorflow pip package >>> > >>> > pip3 install tf_nightly --user >>> > >>> > 1. >>> > >>> > configure llvm build >>> > >>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >>> > >>> > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow >>> \ >>> > >>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - >>> it’s >>> > usually in the path>} >>> > >>> > >>> > >>> > 1. >>> > >>> > build llvm as usual. >>> > 2. >>> > >>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to >>> > clang. >>> > >>> > Details >>> > >>> > The ML model is captured as a TensorFlow ‘saved model’. When building >>> llvm, >>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to compile >>> the >>> > saved model into a native static library and a header file. Insofar >>> as LLVM >>> > is concerned, there are minimal additional runtime requirements, >>> packaged >>> > as source within the pip package: C++ wrappers around the compiled >>> model. >>> > These will also be statically linked in the LLVM target. The compiled >>> code >>> > is otherwise just a complex arithmetical computation, with no special >>> > requirements - it is single threaded and runs natively on the targeted >>> > architecture. Together with the aforementioned runtime dependencies, >>> it >>> > adds ~115KB to the clang binary (0.08% increase) >>> > >>> > Runtime-wise, we observed a ~10% increase in the time spent in the >>> inliner, >>> > for a large (33MB) binary IR module; inlining typically consumes >>> ~10-15% of >>> > total compilation time, so the overall compile time overhead of the >>> > approach is arguably negligible. This cost is almost in entirety >>> > attributable to feature extraction. >>> > >>> > Memory-wise, the precompiled model has a fixed size buffer for its >>> inputs, >>> > and performs a fixed amount of computations, so the memory overhead >>> > inherent to our approach is independent from the program being >>> optimized. >>> > Using a small example to avoid effects such as memory use differences >>> due >>> > to different inlinings, we observed an 300KB increase in the maximum >>> > resident size. >>> > >>> > A table showing effect on -Oz compiled binaries’ size is given in >>> Appendix. >>> > Next directions >>> > >>> > Our next milestone has two main high level goals: detailing a >>> systematic >>> > approach to driving policy effectiveness; and exploring in depth the >>> type >>> > of features and training algorithms most appropriate for compiler >>> problems, >>> > or at least problems like inlining. For the latter, we expect >>> embedding >>> > more of the call graph structure to play an important role, as well >>> as, >>> > potentially, delegating the problem space traversal to the ML model. >>> > >>> > We plan to include inlining for speed as part of our work on these >>> goals. >>> > AppendixTraining - High Level >>> > >>> > We use Reinforcement Learning (RL) to train the Inline-for-size >>> model. At a >>> > high level, it is composed of 3 parts: training data collection, model >>> > training, and iterative data collection/model training. We use >>> TensorFlow >>> > as our ML framework. >>> > >>> > Related, we also needed to learn a separate model to evaluate the >>> native >>> > size of a function, given its IR, in order to calculate a more precise >>> > reward for the reinforcement learning algorithm (“IR2Native”). We >>> evaluated >>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >>> provide >>> > too noisy of a signal for the reward, insofar as the RL training >>> algorithm >>> > for the inlining model was concerned. This model is only used during >>> > training. >>> > >>> > RL - Training data collection: the training data we need to feed into >>> a >>> > reinforcement learning algorithm are sequences consisting of: state >>> of the >>> > problem (i.e. features); action (inline/not inline), and reward >>> (native >>> > size shrinkage after inline/not inline, using ir2native). To collect >>> the >>> > sequences, we hook the logging infrastructure into LLVM Inliner that >>> is >>> > able to produce logs after the inline optimization pass. >>> > >>> > RL - Model training: We use DQN (Deep Q-Network) to train our >>> > inlining-for-size ML policy. On a high level, the DQN algorithm >>> trains a >>> > neural network to predict the value of different actions --- the DQN >>> policy >>> > then chooses to take the action with the highest predicted value. In >>> our >>> > scenario, we have two actions: 1) inline; 2) not inline, so the neural >>> > network predicts the size reduction of these two actions based on >>> features, >>> > and then decides to conduct inlining if the neural network believes >>> doing >>> > inlining leads to higher size reduction. After the training finishes, >>> it >>> > produces a TensorFlow SavedModel that takes features as input and >>> generates >>> > inline decisions (whether to inline or not) as output. >>> > >>> > The choice of the features and reward are essential in model >>> training. The >>> > features are chosen with the consideration of being helpful in making >>> the >>> > decision --- the input to the cost model is a good starting point. >>> Ideally, >>> > the reward in the inline-for-size problem is the native size shrinkage >>> > after inline/not inline. It is difficult to obtain precisely, so we >>> use the >>> > estimate as an alternative. This means that, for training, we also >>> need a >>> > model (not necessarily machine learned) for estimating rewards. >>> > >>> > RL - Iterative data collection/model training: Reinforcement learning >>> is >>> > ideally an iterative model/policy improvement process that: 1) the >>> trained >>> > model is deployed to the field to collect new data; 2) newly >>> collected data >>> > are used to update the model. Thus, we need to do iterative data >>> > collection/model training. To facilitate data collection (avoid >>> complex >>> > build dependencies and time spent before/after the pass being >>> trained), we >>> > isolate out IR modules captured right before the optimization we are >>> > interested in, and iterate on them with opt. We bootstrap the >>> training from >>> > the current heuristic, and stop the process once we are happy with the >>> > outcome. >>> > >>> > IR2Native: We collect IR features (different from the ones used for >>> > inlining) for each function at the end of inlining, right before we >>> perform >>> > function simplification passes, and right after. This means we have >>> two IR >>> > ‘shapes’ of the same function, and we know no further inlinings will >>> be >>> > performed, so whatever changes happen are based on that IR. We then >>> extract >>> > the native size at the end of compilation. Together, this data forms >>> two >>> > records per function that can be used in supervised learning - the >>> features >>> > are those extracted from IR, and the label is the native size. >>> Training >>> > IR2Native happens independently from the training of the inliner >>> model. >>> > Training support for the current scope >>> > >>> > The initial change includes the logging mechanism, an ir2native model >>> > trained for x86-64, and the means to rapidly iterate over development >>> ML >>> > models. For the components that will be included in subsequent >>> changes, the >>> > rest of this section describes the mechanisms we employed. These >>> components >>> > are detailed further below. >>> > >>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the >>> tensorflow C >>> > API library <https://www.tensorflow.org/install/lang_c>. We then >>> configure >>> > the build: >>> > >>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >>> > >>> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >>> > >>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >>> > library over, if it’s not on LD_LIBRARY_PATH} >>> > >>> > >>> > To extract IR right before inlining, we hacked on top of the ThinLTO >>> > infrastructure, interrupting its pre-link pipeline right before >>> inlining. >>> > This means clang produces binary IR files captured at that stage. We >>> then >>> > built a large target, obtaining a corpus of ~25K modules. We intend to >>> > provide a clean mechanism in a subsequent change. >>> > >>> > To obtain features/labels for training this “IR to Native Size” >>> model, we >>> > had to make some changes to the AsmPrinter (to get real native sizes) >>> and >>> > llvm-readobj, as well as add an analysis pass for extracting the IR >>> > features for this model. We plan on upstreaming these changes >>> subsequently. >>> > >>> > Finally, the infrastructure driving the policy training is currently >>> built >>> > on proprietary APIs, as it benefits from a distributed computing >>> > infrastructure. We are currently evaluating options for open sourcing >>> it. >>> > In the meantime, we are presenting the high level implementation >>> details. >>> > >>> > To train a new model, the infrastructure performs 2 steps: extracting >>> the >>> > logs, and using them in a training algorithm. >>> > >>> > Log extraction is highly parallelizable: for each IR module in the >>> training >>> > corpus, we need to run opt once (or a few times, when we explore >>> > improvements). Specifically, each run is this invocation: >>> > >>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to >>> > ir2native> -training-log=<path to training log output> >>> -enable-ml-inliner >>> > -mandatory-inlinings-first -o <output> <module.o> >>> > >>> > Then collect the logs, and pass them to the next step. >>> > >>> > >>> > Experimental results >>> > >>> > Experimental results are available as follows: >>> > >>> > >>> > - >>> > >>> > SPEC2006 >>> > >>> < >>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >>> > >>> > binary sizes (-Oz) and ‘test run’ scores. >>> > - >>> > >>> > Size report >>> > >>> < >>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >>> > >>> > from some internal benchmarks and binaries, including opt and clang >>> > >>> > >>> > _______________________________________________ >>> > LLVM Developers mailing list >>> > llvm-dev at lists.llvm.org >>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/ee2cbefe/attachment.html>
Possibly Parallel Threads
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- [RFC] Add IR level interprocedural outliner for code size.
- [RFC] Add IR level interprocedural outliner for code size.