Mircea Trofin via llvm-dev
2020-Apr-09 02:05 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
+Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> , our team members from the ML side. To avoid formatting issues, here is a link to the RFC <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>, open to comments. Thanks! On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com> wrote:> Unfortunately I needed to update the links. Here they are, hopefully these > work correctly. > > SPEC2006 > <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> > > Internal benchmarks, clang, opt > <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> > > Thanks! > > On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote: > >> It turns out it's me, sorry. Let me see how I can sort this out. In the >> meantime, here is the csv: >> >> SPEC2006 data: >> >> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz >> scores,perf: ML -Oz scores,ML improvement by >> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% >> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% >> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% >> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% >> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% >> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% >> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% >> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% >> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% >> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% >> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% >> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% >> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% >> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% >> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% >> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% >> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% >> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% >> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% >> ,,,,,,, >> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% >> >> >> various benchmarks: >> >> binary,base -Oz size,ML -Oz size,ML shrink over base >> llvm:opt,66318624,62124256,6.32% >> math_1,6181552,5884144,4.81% >> event_management_1,4998728,4802696,3.92% >> llvm:lcalsBRaw,1270168,1222168,3.78% >> llvm:lcalsBLambda,1270232,1222232,3.78% >> llvm:lcalsARaw,1276248,1228248,3.76% >> llvm:lcalsALambda,1276440,1228504,3.76% >> llvm:lcalsCRaw,1278936,1231000,3.75% >> llvm:lcalsCLambda,1279064,1231256,3.74% >> llvm:Blur,1236888,1191192,3.69% >> image_processing_diffusion,1236120,1190488,3.69% >> llvm:Interpolation,1237208,1191576,3.69% >> llvm:harris,1237208,1191768,3.67% >> llvm:Dither,1238232,1192792,3.67% >> event_management_2,4741096,4568584,3.64% >> hashing_1,5189360,5004176,3.57% >> compression_2,5329392,5140496,3.54% >> math_2,6072336,5858992,3.51% >> crypto_1,4721576,4556008,3.51% >> math_4,27720208,26755568,3.48% >> math_3,27732496,26767920,3.48% >> infra_1,30673232,29630464,3.40% >> hashing_2,5834544,5637168,3.38% >> image_processing_entropy,5854960,5657008,3.38% >> hashing_3,5831088,5634288,3.38% >> memory_management_1,4957960,4790632,3.37% >> hasing_4,5816048,5619888,3.37% >> data_storage_2,5852976,5655792,3.37% >> compression_1,5971984,5771120,3.36% >> arena_unittest,6672944,6453584,3.29% >> data_storage_3,44113232,42710960,3.18% >> data_storage_1,22858992,22132960,3.18% >> datastructures_1,6362032,6161264,3.16% >> infra_2,56828720,55053496,3.12% >> datastructures_2,164558704,160790320,2.29% >> llvm:clang,77027416,75672088,1.76% >> benchmark_infra,20221424,19944432,1.37% >> search_1,169475360,167645632,1.08% >> protobuf_3,24071504,23895440,0.73% >> protobuf_1,24071504,23895504,0.73% >> protobuf_2,24071504,23895504,0.73% >> protobuf_4,24071504,23895504,0.73% >> protobuf_5,24071504,23895504,0.73% >> protobuf_6,23982800,23809424,0.72% >> >> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < >> johannesdoerfert at gmail.com> wrote: >> >>> >>> Cool! >>> >>> I skipped to the end and tried to access the gdoc and the spreadsheet >>> but it did tell me I need permissions. Can you make them accessible or >>> am I the problem? >>> >>> Thanks, >>> Johannes >>> >>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >>> > TL;DR; We can improve compiler optimizations driven by heuristics by >>> > replacing those heuristics with machine-learned policies (ML models). >>> > Policies are trained offline and ship as part of the compiler. >>> Determinism >>> > is maintained because models are fixed when the compiler is operating >>> in >>> > production. Fine-tuning or regressions may be handled by >>> incorporating the >>> > interesting cases in the ML training set, retraining the compiler, and >>> > redeploying it. >>> > >>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. We >>> were >>> > able to train an ML model to produce binaries 1.5-6% smaller than -Oz >>> of >>> > tip-of-tree. The trained model appears to generalize well over a >>> diverse >>> > set of binaries. Compile time is increased marginally (under 5%). The >>> model >>> > also happens to produce slightly better - performing code under >>> SPEC2006 - >>> > total score improvement by 1.75%. As we only wanted to verify there >>> is no >>> > significant regression in SPEC, and given the milestone goals, we >>> haven’t >>> > dug any deeper into the speed results. >>> > >>> > We see these results as promising, and as a reasonable point for >>> > contributing our current work as a build-time opt-in to LLVM to >>> benefit the >>> > community, in the hope of fostering collaboration and learning from >>> the >>> > community’s feedback, as we try to better understand the trade-offs >>> such an >>> > approach entails, and as we work on expanding the depth and breadth of >>> > applying these techniques to compiler optimization problems. >>> > >>> > https://reviews.llvm.org/D77752 >>> > Motivation >>> > >>> > Optimization problems, such as inlining or register allocation, are >>> hard: >>> > we don’t know of algorithms that are guaranteed to produce the optimum >>> > solution in all cases in a timely fashion. Instead, we use heuristics: >>> > algorithms that we expect to produce some approximation of the >>> optimum, >>> > with some expected degree of generality, in a practical amount of >>> time. >>> > >>> > Heuristics have some common characteristics. Taking inlining as a case >>> > study, it traverses the problem space in some way (bottom-up >>> traversal of >>> > the SCC graph), extracts some properties (let’s call them “features”) >>> of >>> > the program being optimized, and combine them with some weights >>> (“tune”), >>> > to produce a cost (InlineCost), which allows for trade-off analysis. >>> We >>> > validate the effectiveness of a heuristic over some accepted set of >>> > benchmarks. Over time, we react to regressions or pathological cases >>> > observed in the field, by manually analyzing such cases, figuring out >>> an >>> > enhancement to the heuristic, and then re-validating over that set of >>> > benchmarks (maybe augmented by adding the newly found cases). >>> > >>> > Because heuristics are code that needs to be maintained, there is >>> pressure >>> > to reduce complexity: adding more features means we need to reason >>> about >>> > the interactions between the old and new features, which scales >>> > combinatorially. Re-tuning because of the new features adds a similar >>> kind >>> > of complexity. Potentially, we miss out on optimization improvements >>> as a >>> > result. >>> > >>> > Because tuning is manual, there is pressure to keep the number of >>> > benchmarks that can be studied in depth to a humanly-manageable size, >>> which >>> > potentially affects the generality of a heuristic or heuristic tuning. >>> > >>> > The main advantage of manual heuristics is arguably their relatively >>> lower >>> > overhead: no additional dependencies and more transparent to human >>> analysis >>> > and improvement. >>> > >>> > Machine learning, in particular reinforcement learning, can address >>> the >>> > tensions found in manual heuristics: once features are extracted from >>> the >>> > program, the way they are combined and tuned can easily be scaled up >>> > through automation, improving effectiveness and generality. A major >>> > drawback, at least at this point in time, of machine learning, is >>> that we >>> > don’t yet have a fully developed systematic approach for improving >>> policy >>> > effectiveness. >>> > High level design >>> > >>> > We identify two scenarios for a compiler using ML policies: >>> development and >>> > release. >>> > >>> > The release scenario is equivalent to the regular compilation we have >>> today >>> > - the only difference is that it uses a pre-trained model (trained in >>> the >>> > development scenario beforehand) to make decisions instead of the >>> > heuristics. Determinism is guaranteed since the model in the release >>> > scenario is fixed. We imagine teams wishing to fine tune the >>> effectiveness >>> > of the optimization to their scenarios would train a different model. >>> > >>> > The decision previously evaluated using a human-crafted heuristic is >>> > optionally replaced by: >>> > >>> > - >>> > >>> > a compiler-specific component, extracting features from IR (i.e. a >>> > vector of values) >>> > - >>> > >>> > an evaluation of an ML model using those features, to obtain a >>> result. >>> > In ML nomenclature, this is referred to using the model for >>> inference (as >>> > opposed to training it) >>> > >>> > For example, when we replaced the decision of whether to inline a >>> callsite, >>> > the ML model produces a boolean (inline/don’t inline) based on a >>> features >>> > vector characterizing the call site and some broader module-wide >>> context. >>> > >>> > Training/development is more complicated, and happens offline - akin >>> to >>> > how, today, attempts to improve an optimizing pass also happen >>> offline. A >>> > description of the high level design and the specifics we used for the >>> > current scope are given in Appendix. >>> > Current Scope >>> > >>> > The goal of our first milestone was to evaluate end to end an >>> integration >>> > of ML with LLVM, and get a first promising result. To that end, we >>> chose >>> > inlining for size (-Oz) as a stepping stone, as we perceived it to be >>> more >>> > likely to require a simpler evaluation setup than performance-oriented >>> > optimizations might. At this point, we only train whether a call site >>> may >>> > be inlined or not, leaving the SCC traversal order as-is. >>> > >>> > We are proposing an initial change demonstrating the inference >>> mechanism >>> > using a pre-trained model, as a build-time opt-in to llvm. The >>> compiler >>> > components needed to perform training are also included in this first >>> > change. Subsequent changes would include more training-related >>> components. >>> > >>> > At a high level, the changes we are proposing consist of: >>> > >>> > 1. >>> > >>> > a new module analysis pass, InliningAdvisor. By default, its >>> > implementation does nothing. >>> > 2. >>> > >>> > minimal hooks into Inliner.cpp. >>> > 3. >>> > >>> > the implementation of InliningAdvisor, activated when we opt-in >>> ML. This >>> > is available in Analysis/ML, together with: >>> > 1. >>> > >>> > Rel/Dev specific ML model handing, also under Analysis/ML >>> > 2. >>> > >>> > a pre-trained model for inlining for size >>> > (Analysis/ML/models/inlining) >>> > 3. >>> > >>> > a pre-trained model for predicting native size from IR >>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. >>> > 4. >>> > >>> > Some refactorings in PassBuilder, to allow opting into running >>> mandatory >>> > inlining first - some compilation speedup for the ML case, minimal, >>> > noise-like size effect. Also simplifies testing (these would be >>> introduced >>> > as a preliminary patch). >>> > >>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is >>> more >>> > involved. >>> > Inference Opt-In Mechanism >>> > >>> > The feature is primarily controlled by the cmake flag >>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The >>> > “Rel”ease case requires specifying the location of the pip tensorflow >>> > package (currently, that’s tf_nightly, and it should soon be >>> available in >>> > tensorflow) >>> > >>> > To opt in the ‘Rel’ case: >>> > >>> > 1. >>> > >>> > install tensorflow pip package >>> > >>> > pip3 install tf_nightly --user >>> > >>> > 1. >>> > >>> > configure llvm build >>> > >>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >>> > >>> > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow >>> \ >>> > >>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - >>> it’s >>> > usually in the path>} >>> > >>> > >>> > >>> > 1. >>> > >>> > build llvm as usual. >>> > 2. >>> > >>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to >>> > clang. >>> > >>> > Details >>> > >>> > The ML model is captured as a TensorFlow ‘saved model’. When building >>> llvm, >>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to compile >>> the >>> > saved model into a native static library and a header file. Insofar >>> as LLVM >>> > is concerned, there are minimal additional runtime requirements, >>> packaged >>> > as source within the pip package: C++ wrappers around the compiled >>> model. >>> > These will also be statically linked in the LLVM target. The compiled >>> code >>> > is otherwise just a complex arithmetical computation, with no special >>> > requirements - it is single threaded and runs natively on the targeted >>> > architecture. Together with the aforementioned runtime dependencies, >>> it >>> > adds ~115KB to the clang binary (0.08% increase) >>> > >>> > Runtime-wise, we observed a ~10% increase in the time spent in the >>> inliner, >>> > for a large (33MB) binary IR module; inlining typically consumes >>> ~10-15% of >>> > total compilation time, so the overall compile time overhead of the >>> > approach is arguably negligible. This cost is almost in entirety >>> > attributable to feature extraction. >>> > >>> > Memory-wise, the precompiled model has a fixed size buffer for its >>> inputs, >>> > and performs a fixed amount of computations, so the memory overhead >>> > inherent to our approach is independent from the program being >>> optimized. >>> > Using a small example to avoid effects such as memory use differences >>> due >>> > to different inlinings, we observed an 300KB increase in the maximum >>> > resident size. >>> > >>> > A table showing effect on -Oz compiled binaries’ size is given in >>> Appendix. >>> > Next directions >>> > >>> > Our next milestone has two main high level goals: detailing a >>> systematic >>> > approach to driving policy effectiveness; and exploring in depth the >>> type >>> > of features and training algorithms most appropriate for compiler >>> problems, >>> > or at least problems like inlining. For the latter, we expect >>> embedding >>> > more of the call graph structure to play an important role, as well >>> as, >>> > potentially, delegating the problem space traversal to the ML model. >>> > >>> > We plan to include inlining for speed as part of our work on these >>> goals. >>> > AppendixTraining - High Level >>> > >>> > We use Reinforcement Learning (RL) to train the Inline-for-size >>> model. At a >>> > high level, it is composed of 3 parts: training data collection, model >>> > training, and iterative data collection/model training. We use >>> TensorFlow >>> > as our ML framework. >>> > >>> > Related, we also needed to learn a separate model to evaluate the >>> native >>> > size of a function, given its IR, in order to calculate a more precise >>> > reward for the reinforcement learning algorithm (“IR2Native”). We >>> evaluated >>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >>> provide >>> > too noisy of a signal for the reward, insofar as the RL training >>> algorithm >>> > for the inlining model was concerned. This model is only used during >>> > training. >>> > >>> > RL - Training data collection: the training data we need to feed into >>> a >>> > reinforcement learning algorithm are sequences consisting of: state >>> of the >>> > problem (i.e. features); action (inline/not inline), and reward >>> (native >>> > size shrinkage after inline/not inline, using ir2native). To collect >>> the >>> > sequences, we hook the logging infrastructure into LLVM Inliner that >>> is >>> > able to produce logs after the inline optimization pass. >>> > >>> > RL - Model training: We use DQN (Deep Q-Network) to train our >>> > inlining-for-size ML policy. On a high level, the DQN algorithm >>> trains a >>> > neural network to predict the value of different actions --- the DQN >>> policy >>> > then chooses to take the action with the highest predicted value. In >>> our >>> > scenario, we have two actions: 1) inline; 2) not inline, so the neural >>> > network predicts the size reduction of these two actions based on >>> features, >>> > and then decides to conduct inlining if the neural network believes >>> doing >>> > inlining leads to higher size reduction. After the training finishes, >>> it >>> > produces a TensorFlow SavedModel that takes features as input and >>> generates >>> > inline decisions (whether to inline or not) as output. >>> > >>> > The choice of the features and reward are essential in model >>> training. The >>> > features are chosen with the consideration of being helpful in making >>> the >>> > decision --- the input to the cost model is a good starting point. >>> Ideally, >>> > the reward in the inline-for-size problem is the native size shrinkage >>> > after inline/not inline. It is difficult to obtain precisely, so we >>> use the >>> > estimate as an alternative. This means that, for training, we also >>> need a >>> > model (not necessarily machine learned) for estimating rewards. >>> > >>> > RL - Iterative data collection/model training: Reinforcement learning >>> is >>> > ideally an iterative model/policy improvement process that: 1) the >>> trained >>> > model is deployed to the field to collect new data; 2) newly >>> collected data >>> > are used to update the model. Thus, we need to do iterative data >>> > collection/model training. To facilitate data collection (avoid >>> complex >>> > build dependencies and time spent before/after the pass being >>> trained), we >>> > isolate out IR modules captured right before the optimization we are >>> > interested in, and iterate on them with opt. We bootstrap the >>> training from >>> > the current heuristic, and stop the process once we are happy with the >>> > outcome. >>> > >>> > IR2Native: We collect IR features (different from the ones used for >>> > inlining) for each function at the end of inlining, right before we >>> perform >>> > function simplification passes, and right after. This means we have >>> two IR >>> > ‘shapes’ of the same function, and we know no further inlinings will >>> be >>> > performed, so whatever changes happen are based on that IR. We then >>> extract >>> > the native size at the end of compilation. Together, this data forms >>> two >>> > records per function that can be used in supervised learning - the >>> features >>> > are those extracted from IR, and the label is the native size. >>> Training >>> > IR2Native happens independently from the training of the inliner >>> model. >>> > Training support for the current scope >>> > >>> > The initial change includes the logging mechanism, an ir2native model >>> > trained for x86-64, and the means to rapidly iterate over development >>> ML >>> > models. For the components that will be included in subsequent >>> changes, the >>> > rest of this section describes the mechanisms we employed. These >>> components >>> > are detailed further below. >>> > >>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the >>> tensorflow C >>> > API library <https://www.tensorflow.org/install/lang_c>. We then >>> configure >>> > the build: >>> > >>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >>> > >>> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >>> > >>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >>> > library over, if it’s not on LD_LIBRARY_PATH} >>> > >>> > >>> > To extract IR right before inlining, we hacked on top of the ThinLTO >>> > infrastructure, interrupting its pre-link pipeline right before >>> inlining. >>> > This means clang produces binary IR files captured at that stage. We >>> then >>> > built a large target, obtaining a corpus of ~25K modules. We intend to >>> > provide a clean mechanism in a subsequent change. >>> > >>> > To obtain features/labels for training this “IR to Native Size” >>> model, we >>> > had to make some changes to the AsmPrinter (to get real native sizes) >>> and >>> > llvm-readobj, as well as add an analysis pass for extracting the IR >>> > features for this model. We plan on upstreaming these changes >>> subsequently. >>> > >>> > Finally, the infrastructure driving the policy training is currently >>> built >>> > on proprietary APIs, as it benefits from a distributed computing >>> > infrastructure. We are currently evaluating options for open sourcing >>> it. >>> > In the meantime, we are presenting the high level implementation >>> details. >>> > >>> > To train a new model, the infrastructure performs 2 steps: extracting >>> the >>> > logs, and using them in a training algorithm. >>> > >>> > Log extraction is highly parallelizable: for each IR module in the >>> training >>> > corpus, we need to run opt once (or a few times, when we explore >>> > improvements). Specifically, each run is this invocation: >>> > >>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to >>> > ir2native> -training-log=<path to training log output> >>> -enable-ml-inliner >>> > -mandatory-inlinings-first -o <output> <module.o> >>> > >>> > Then collect the logs, and pass them to the next step. >>> > >>> > >>> > Experimental results >>> > >>> > Experimental results are available as follows: >>> > >>> > >>> > - >>> > >>> > SPEC2006 >>> > >>> < >>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >>> > >>> > binary sizes (-Oz) and ‘test run’ scores. >>> > - >>> > >>> > Size report >>> > >>> < >>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >>> > >>> > from some internal benchmarks and binaries, including opt and clang >>> > >>> > >>> > _______________________________________________ >>> > LLVM Developers mailing list >>> > llvm-dev at lists.llvm.org >>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/ee2cbefe/attachment.html>
Tsung-Yu Hsieh via llvm-dev
2020-Apr-09 02:18 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
Hey Developers: That is cool. In fact, I also implement policy gradient to do register allocation task on LLVM at 2 years ago and get some amazing feedback. This is my project: https://github.com/Knight-X/rLens . I interest on doing optimization job by machine learning techniques. I am looking forward your progress and hopefully do some interesting work with you guys. BR, Kazami Mircea Trofin via llvm-dev <llvm-dev at lists.llvm.org> 於 2020年4月9日 週四 上午10:06寫道:> +Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> , our > team members from the ML side. > > To avoid formatting issues, here is a link to the RFC > <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>, > open to comments. > > Thanks! > > On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com> wrote: > >> Unfortunately I needed to update the links. Here they are, hopefully >> these work correctly. >> >> SPEC2006 >> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> >> >> Internal benchmarks, clang, opt >> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> >> >> Thanks! >> >> On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote: >> >>> It turns out it's me, sorry. Let me see how I can sort this out. In the >>> meantime, here is the csv: >>> >>> SPEC2006 data: >>> >>> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz >>> scores,perf: ML -Oz scores,ML improvement by >>> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% >>> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% >>> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% >>> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% >>> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% >>> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% >>> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% >>> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% >>> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% >>> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% >>> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% >>> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% >>> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% >>> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% >>> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% >>> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% >>> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% >>> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% >>> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% >>> ,,,,,,, >>> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% >>> >>> >>> various benchmarks: >>> >>> binary,base -Oz size,ML -Oz size,ML shrink over base >>> llvm:opt,66318624,62124256,6.32% >>> math_1,6181552,5884144,4.81% >>> event_management_1,4998728,4802696,3.92% >>> llvm:lcalsBRaw,1270168,1222168,3.78% >>> llvm:lcalsBLambda,1270232,1222232,3.78% >>> llvm:lcalsARaw,1276248,1228248,3.76% >>> llvm:lcalsALambda,1276440,1228504,3.76% >>> llvm:lcalsCRaw,1278936,1231000,3.75% >>> llvm:lcalsCLambda,1279064,1231256,3.74% >>> llvm:Blur,1236888,1191192,3.69% >>> image_processing_diffusion,1236120,1190488,3.69% >>> llvm:Interpolation,1237208,1191576,3.69% >>> llvm:harris,1237208,1191768,3.67% >>> llvm:Dither,1238232,1192792,3.67% >>> event_management_2,4741096,4568584,3.64% >>> hashing_1,5189360,5004176,3.57% >>> compression_2,5329392,5140496,3.54% >>> math_2,6072336,5858992,3.51% >>> crypto_1,4721576,4556008,3.51% >>> math_4,27720208,26755568,3.48% >>> math_3,27732496,26767920,3.48% >>> infra_1,30673232,29630464,3.40% >>> hashing_2,5834544,5637168,3.38% >>> image_processing_entropy,5854960,5657008,3.38% >>> hashing_3,5831088,5634288,3.38% >>> memory_management_1,4957960,4790632,3.37% >>> hasing_4,5816048,5619888,3.37% >>> data_storage_2,5852976,5655792,3.37% >>> compression_1,5971984,5771120,3.36% >>> arena_unittest,6672944,6453584,3.29% >>> data_storage_3,44113232,42710960,3.18% >>> data_storage_1,22858992,22132960,3.18% >>> datastructures_1,6362032,6161264,3.16% >>> infra_2,56828720,55053496,3.12% >>> datastructures_2,164558704,160790320,2.29% >>> llvm:clang,77027416,75672088,1.76% >>> benchmark_infra,20221424,19944432,1.37% >>> search_1,169475360,167645632,1.08% >>> protobuf_3,24071504,23895440,0.73% >>> protobuf_1,24071504,23895504,0.73% >>> protobuf_2,24071504,23895504,0.73% >>> protobuf_4,24071504,23895504,0.73% >>> protobuf_5,24071504,23895504,0.73% >>> protobuf_6,23982800,23809424,0.72% >>> >>> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < >>> johannesdoerfert at gmail.com> wrote: >>> >>>> >>>> Cool! >>>> >>>> I skipped to the end and tried to access the gdoc and the spreadsheet >>>> but it did tell me I need permissions. Can you make them accessible or >>>> am I the problem? >>>> >>>> Thanks, >>>> Johannes >>>> >>>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >>>> > TL;DR; We can improve compiler optimizations driven by heuristics by >>>> > replacing those heuristics with machine-learned policies (ML models). >>>> > Policies are trained offline and ship as part of the compiler. >>>> Determinism >>>> > is maintained because models are fixed when the compiler is >>>> operating in >>>> > production. Fine-tuning or regressions may be handled by >>>> incorporating the >>>> > interesting cases in the ML training set, retraining the compiler, >>>> and >>>> > redeploying it. >>>> > >>>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. >>>> We >>>> were >>>> > able to train an ML model to produce binaries 1.5-6% smaller than >>>> -Oz of >>>> > tip-of-tree. The trained model appears to generalize well over a >>>> diverse >>>> > set of binaries. Compile time is increased marginally (under 5%). >>>> The >>>> model >>>> > also happens to produce slightly better - performing code under >>>> SPEC2006 - >>>> > total score improvement by 1.75%. As we only wanted to verify there >>>> is no >>>> > significant regression in SPEC, and given the milestone goals, we >>>> haven’t >>>> > dug any deeper into the speed results. >>>> > >>>> > We see these results as promising, and as a reasonable point for >>>> > contributing our current work as a build-time opt-in to LLVM to >>>> benefit the >>>> > community, in the hope of fostering collaboration and learning from >>>> the >>>> > community’s feedback, as we try to better understand the trade-offs >>>> such an >>>> > approach entails, and as we work on expanding the depth and breadth >>>> of >>>> > applying these techniques to compiler optimization problems. >>>> > >>>> > https://reviews.llvm.org/D77752 >>>> > Motivation >>>> > >>>> > Optimization problems, such as inlining or register allocation, are >>>> hard: >>>> > we don’t know of algorithms that are guaranteed to produce the >>>> optimum >>>> > solution in all cases in a timely fashion. Instead, we use >>>> heuristics: >>>> > algorithms that we expect to produce some approximation of the >>>> optimum, >>>> > with some expected degree of generality, in a practical amount of >>>> time. >>>> > >>>> > Heuristics have some common characteristics. Taking inlining as a >>>> case >>>> > study, it traverses the problem space in some way (bottom-up >>>> traversal of >>>> > the SCC graph), extracts some properties (let’s call them >>>> “features”) of >>>> > the program being optimized, and combine them with some weights >>>> (“tune”), >>>> > to produce a cost (InlineCost), which allows for trade-off analysis. >>>> We >>>> > validate the effectiveness of a heuristic over some accepted set of >>>> > benchmarks. Over time, we react to regressions or pathological cases >>>> > observed in the field, by manually analyzing such cases, figuring >>>> out an >>>> > enhancement to the heuristic, and then re-validating over that set of >>>> > benchmarks (maybe augmented by adding the newly found cases). >>>> > >>>> > Because heuristics are code that needs to be maintained, there is >>>> pressure >>>> > to reduce complexity: adding more features means we need to reason >>>> about >>>> > the interactions between the old and new features, which scales >>>> > combinatorially. Re-tuning because of the new features adds a >>>> similar >>>> kind >>>> > of complexity. Potentially, we miss out on optimization improvements >>>> as a >>>> > result. >>>> > >>>> > Because tuning is manual, there is pressure to keep the number of >>>> > benchmarks that can be studied in depth to a humanly-manageable >>>> size, >>>> which >>>> > potentially affects the generality of a heuristic or heuristic >>>> tuning. >>>> > >>>> > The main advantage of manual heuristics is arguably their relatively >>>> lower >>>> > overhead: no additional dependencies and more transparent to human >>>> analysis >>>> > and improvement. >>>> > >>>> > Machine learning, in particular reinforcement learning, can address >>>> the >>>> > tensions found in manual heuristics: once features are extracted >>>> from the >>>> > program, the way they are combined and tuned can easily be scaled up >>>> > through automation, improving effectiveness and generality. A major >>>> > drawback, at least at this point in time, of machine learning, is >>>> that we >>>> > don’t yet have a fully developed systematic approach for improving >>>> policy >>>> > effectiveness. >>>> > High level design >>>> > >>>> > We identify two scenarios for a compiler using ML policies: >>>> development and >>>> > release. >>>> > >>>> > The release scenario is equivalent to the regular compilation we >>>> have >>>> today >>>> > - the only difference is that it uses a pre-trained model (trained >>>> in the >>>> > development scenario beforehand) to make decisions instead of the >>>> > heuristics. Determinism is guaranteed since the model in the release >>>> > scenario is fixed. We imagine teams wishing to fine tune the >>>> effectiveness >>>> > of the optimization to their scenarios would train a different model. >>>> > >>>> > The decision previously evaluated using a human-crafted heuristic is >>>> > optionally replaced by: >>>> > >>>> > - >>>> > >>>> > a compiler-specific component, extracting features from IR (i.e. a >>>> > vector of values) >>>> > - >>>> > >>>> > an evaluation of an ML model using those features, to obtain a >>>> result. >>>> > In ML nomenclature, this is referred to using the model for >>>> inference (as >>>> > opposed to training it) >>>> > >>>> > For example, when we replaced the decision of whether to inline a >>>> callsite, >>>> > the ML model produces a boolean (inline/don’t inline) based on a >>>> features >>>> > vector characterizing the call site and some broader module-wide >>>> context. >>>> > >>>> > Training/development is more complicated, and happens offline - akin >>>> to >>>> > how, today, attempts to improve an optimizing pass also happen >>>> offline. A >>>> > description of the high level design and the specifics we used for >>>> the >>>> > current scope are given in Appendix. >>>> > Current Scope >>>> > >>>> > The goal of our first milestone was to evaluate end to end an >>>> integration >>>> > of ML with LLVM, and get a first promising result. To that end, we >>>> chose >>>> > inlining for size (-Oz) as a stepping stone, as we perceived it to >>>> be >>>> more >>>> > likely to require a simpler evaluation setup than >>>> performance-oriented >>>> > optimizations might. At this point, we only train whether a call >>>> site may >>>> > be inlined or not, leaving the SCC traversal order as-is. >>>> > >>>> > We are proposing an initial change demonstrating the inference >>>> mechanism >>>> > using a pre-trained model, as a build-time opt-in to llvm. The >>>> compiler >>>> > components needed to perform training are also included in this first >>>> > change. Subsequent changes would include more training-related >>>> components. >>>> > >>>> > At a high level, the changes we are proposing consist of: >>>> > >>>> > 1. >>>> > >>>> > a new module analysis pass, InliningAdvisor. By default, its >>>> > implementation does nothing. >>>> > 2. >>>> > >>>> > minimal hooks into Inliner.cpp. >>>> > 3. >>>> > >>>> > the implementation of InliningAdvisor, activated when we opt-in >>>> ML. This >>>> > is available in Analysis/ML, together with: >>>> > 1. >>>> > >>>> > Rel/Dev specific ML model handing, also under Analysis/ML >>>> > 2. >>>> > >>>> > a pre-trained model for inlining for size >>>> > (Analysis/ML/models/inlining) >>>> > 3. >>>> > >>>> > a pre-trained model for predicting native size from IR >>>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. >>>> > 4. >>>> > >>>> > Some refactorings in PassBuilder, to allow opting into running >>>> mandatory >>>> > inlining first - some compilation speedup for the ML case, >>>> minimal, >>>> > noise-like size effect. Also simplifies testing (these would be >>>> introduced >>>> > as a preliminary patch). >>>> > >>>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is >>>> more >>>> > involved. >>>> > Inference Opt-In Mechanism >>>> > >>>> > The feature is primarily controlled by the cmake flag >>>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. >>>> The >>>> > “Rel”ease case requires specifying the location of the pip tensorflow >>>> > package (currently, that’s tf_nightly, and it should soon be >>>> available in >>>> > tensorflow) >>>> > >>>> > To opt in the ‘Rel’ case: >>>> > >>>> > 1. >>>> > >>>> > install tensorflow pip package >>>> > >>>> > pip3 install tf_nightly --user >>>> > >>>> > 1. >>>> > >>>> > configure llvm build >>>> > >>>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >>>> > >>>> > >>>> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \ >>>> > >>>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - >>>> it’s >>>> > usually in the path>} >>>> > >>>> > >>>> > >>>> > 1. >>>> > >>>> > build llvm as usual. >>>> > 2. >>>> > >>>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first >>>> to >>>> > clang. >>>> > >>>> > Details >>>> > >>>> > The ML model is captured as a TensorFlow ‘saved model’. When >>>> building >>>> llvm, >>>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to >>>> compile the >>>> > saved model into a native static library and a header file. Insofar >>>> as LLVM >>>> > is concerned, there are minimal additional runtime requirements, >>>> packaged >>>> > as source within the pip package: C++ wrappers around the compiled >>>> model. >>>> > These will also be statically linked in the LLVM target. The >>>> compiled >>>> code >>>> > is otherwise just a complex arithmetical computation, with no special >>>> > requirements - it is single threaded and runs natively on the >>>> targeted >>>> > architecture. Together with the aforementioned runtime dependencies, >>>> it >>>> > adds ~115KB to the clang binary (0.08% increase) >>>> > >>>> > Runtime-wise, we observed a ~10% increase in the time spent in the >>>> inliner, >>>> > for a large (33MB) binary IR module; inlining typically consumes >>>> ~10-15% of >>>> > total compilation time, so the overall compile time overhead of the >>>> > approach is arguably negligible. This cost is almost in entirety >>>> > attributable to feature extraction. >>>> > >>>> > Memory-wise, the precompiled model has a fixed size buffer for its >>>> inputs, >>>> > and performs a fixed amount of computations, so the memory overhead >>>> > inherent to our approach is independent from the program being >>>> optimized. >>>> > Using a small example to avoid effects such as memory use >>>> differences due >>>> > to different inlinings, we observed an 300KB increase in the maximum >>>> > resident size. >>>> > >>>> > A table showing effect on -Oz compiled binaries’ size is given in >>>> Appendix. >>>> > Next directions >>>> > >>>> > Our next milestone has two main high level goals: detailing a >>>> systematic >>>> > approach to driving policy effectiveness; and exploring in depth the >>>> type >>>> > of features and training algorithms most appropriate for compiler >>>> problems, >>>> > or at least problems like inlining. For the latter, we expect >>>> embedding >>>> > more of the call graph structure to play an important role, as well >>>> as, >>>> > potentially, delegating the problem space traversal to the ML model. >>>> > >>>> > We plan to include inlining for speed as part of our work on these >>>> goals. >>>> > AppendixTraining - High Level >>>> > >>>> > We use Reinforcement Learning (RL) to train the Inline-for-size >>>> model. At a >>>> > high level, it is composed of 3 parts: training data collection, >>>> model >>>> > training, and iterative data collection/model training. We use >>>> TensorFlow >>>> > as our ML framework. >>>> > >>>> > Related, we also needed to learn a separate model to evaluate the >>>> native >>>> > size of a function, given its IR, in order to calculate a more >>>> precise >>>> > reward for the reinforcement learning algorithm (“IR2Native”). We >>>> evaluated >>>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >>>> provide >>>> > too noisy of a signal for the reward, insofar as the RL training >>>> algorithm >>>> > for the inlining model was concerned. This model is only used during >>>> > training. >>>> > >>>> > RL - Training data collection: the training data we need to feed >>>> into a >>>> > reinforcement learning algorithm are sequences consisting of: state >>>> of the >>>> > problem (i.e. features); action (inline/not inline), and reward >>>> (native >>>> > size shrinkage after inline/not inline, using ir2native). To collect >>>> the >>>> > sequences, we hook the logging infrastructure into LLVM Inliner that >>>> is >>>> > able to produce logs after the inline optimization pass. >>>> > >>>> > RL - Model training: We use DQN (Deep Q-Network) to train our >>>> > inlining-for-size ML policy. On a high level, the DQN algorithm >>>> trains a >>>> > neural network to predict the value of different actions --- the DQN >>>> policy >>>> > then chooses to take the action with the highest predicted value. In >>>> our >>>> > scenario, we have two actions: 1) inline; 2) not inline, so the >>>> neural >>>> > network predicts the size reduction of these two actions based on >>>> features, >>>> > and then decides to conduct inlining if the neural network believes >>>> doing >>>> > inlining leads to higher size reduction. After the training >>>> finishes, it >>>> > produces a TensorFlow SavedModel that takes features as input and >>>> generates >>>> > inline decisions (whether to inline or not) as output. >>>> > >>>> > The choice of the features and reward are essential in model >>>> training. The >>>> > features are chosen with the consideration of being helpful in >>>> making the >>>> > decision --- the input to the cost model is a good starting point. >>>> Ideally, >>>> > the reward in the inline-for-size problem is the native size >>>> shrinkage >>>> > after inline/not inline. It is difficult to obtain precisely, so we >>>> use the >>>> > estimate as an alternative. This means that, for training, we also >>>> need a >>>> > model (not necessarily machine learned) for estimating rewards. >>>> > >>>> > RL - Iterative data collection/model training: Reinforcement >>>> learning is >>>> > ideally an iterative model/policy improvement process that: 1) the >>>> trained >>>> > model is deployed to the field to collect new data; 2) newly >>>> collected data >>>> > are used to update the model. Thus, we need to do iterative data >>>> > collection/model training. To facilitate data collection (avoid >>>> complex >>>> > build dependencies and time spent before/after the pass being >>>> trained), we >>>> > isolate out IR modules captured right before the optimization we are >>>> > interested in, and iterate on them with opt. We bootstrap the >>>> training from >>>> > the current heuristic, and stop the process once we are happy with >>>> the >>>> > outcome. >>>> > >>>> > IR2Native: We collect IR features (different from the ones used for >>>> > inlining) for each function at the end of inlining, right before we >>>> perform >>>> > function simplification passes, and right after. This means we have >>>> two IR >>>> > ‘shapes’ of the same function, and we know no further inlinings will >>>> be >>>> > performed, so whatever changes happen are based on that IR. We then >>>> extract >>>> > the native size at the end of compilation. Together, this data forms >>>> two >>>> > records per function that can be used in supervised learning - the >>>> features >>>> > are those extracted from IR, and the label is the native size. >>>> Training >>>> > IR2Native happens independently from the training of the inliner >>>> model. >>>> > Training support for the current scope >>>> > >>>> > The initial change includes the logging mechanism, an ir2native model >>>> > trained for x86-64, and the means to rapidly iterate over >>>> development ML >>>> > models. For the components that will be included in subsequent >>>> changes, the >>>> > rest of this section describes the mechanisms we employed. These >>>> components >>>> > are detailed further below. >>>> > >>>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the >>>> tensorflow C >>>> > API library <https://www.tensorflow.org/install/lang_c>. We then >>>> configure >>>> > the build: >>>> > >>>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >>>> > >>>> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >>>> > >>>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >>>> > library over, if it’s not on LD_LIBRARY_PATH} >>>> > >>>> > >>>> > To extract IR right before inlining, we hacked on top of the ThinLTO >>>> > infrastructure, interrupting its pre-link pipeline right before >>>> inlining. >>>> > This means clang produces binary IR files captured at that stage. We >>>> then >>>> > built a large target, obtaining a corpus of ~25K modules. We intend >>>> to >>>> > provide a clean mechanism in a subsequent change. >>>> > >>>> > To obtain features/labels for training this “IR to Native Size” >>>> model, we >>>> > had to make some changes to the AsmPrinter (to get real native >>>> sizes) and >>>> > llvm-readobj, as well as add an analysis pass for extracting the IR >>>> > features for this model. We plan on upstreaming these changes >>>> subsequently. >>>> > >>>> > Finally, the infrastructure driving the policy training is currently >>>> built >>>> > on proprietary APIs, as it benefits from a distributed computing >>>> > infrastructure. We are currently evaluating options for open >>>> sourcing it. >>>> > In the meantime, we are presenting the high level implementation >>>> details. >>>> > >>>> > To train a new model, the infrastructure performs 2 steps: >>>> extracting the >>>> > logs, and using them in a training algorithm. >>>> > >>>> > Log extraction is highly parallelizable: for each IR module in the >>>> training >>>> > corpus, we need to run opt once (or a few times, when we explore >>>> > improvements). Specifically, each run is this invocation: >>>> > >>>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path >>>> to >>>> > ir2native> -training-log=<path to training log output> >>>> -enable-ml-inliner >>>> > -mandatory-inlinings-first -o <output> <module.o> >>>> > >>>> > Then collect the logs, and pass them to the next step. >>>> > >>>> > >>>> > Experimental results >>>> > >>>> > Experimental results are available as follows: >>>> > >>>> > >>>> > - >>>> > >>>> > SPEC2006 >>>> > >>>> < >>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >>>> > >>>> > binary sizes (-Oz) and ‘test run’ scores. >>>> > - >>>> > >>>> > Size report >>>> > >>>> < >>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >>>> > >>>> > from some internal benchmarks and binaries, including opt and >>>> clang >>>> > >>>> > >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200409/751e046f/attachment-0001.html>
Xinliang David Li via llvm-dev
2020-Apr-09 16:56 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
One suggestion : should we consolidate the discussion into the main thread? I know some folks are not willing to comment in Google docs. David On Wed, Apr 8, 2020 at 7:06 PM Mircea Trofin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> +Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> , our > team members from the ML side. > > To avoid formatting issues, here is a link to the RFC > <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>, > open to comments. > > Thanks! > > On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com> wrote: > >> Unfortunately I needed to update the links. Here they are, hopefully >> these work correctly. >> >> SPEC2006 >> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> >> >> Internal benchmarks, clang, opt >> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> >> >> Thanks! >> >> On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote: >> >>> It turns out it's me, sorry. Let me see how I can sort this out. In the >>> meantime, here is the csv: >>> >>> SPEC2006 data: >>> >>> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz >>> scores,perf: ML -Oz scores,ML improvement by >>> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% >>> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% >>> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% >>> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% >>> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% >>> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% >>> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% >>> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% >>> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% >>> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% >>> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% >>> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% >>> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% >>> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% >>> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% >>> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% >>> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% >>> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% >>> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% >>> ,,,,,,, >>> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% >>> >>> >>> various benchmarks: >>> >>> binary,base -Oz size,ML -Oz size,ML shrink over base >>> llvm:opt,66318624,62124256,6.32% >>> math_1,6181552,5884144,4.81% >>> event_management_1,4998728,4802696,3.92% >>> llvm:lcalsBRaw,1270168,1222168,3.78% >>> llvm:lcalsBLambda,1270232,1222232,3.78% >>> llvm:lcalsARaw,1276248,1228248,3.76% >>> llvm:lcalsALambda,1276440,1228504,3.76% >>> llvm:lcalsCRaw,1278936,1231000,3.75% >>> llvm:lcalsCLambda,1279064,1231256,3.74% >>> llvm:Blur,1236888,1191192,3.69% >>> image_processing_diffusion,1236120,1190488,3.69% >>> llvm:Interpolation,1237208,1191576,3.69% >>> llvm:harris,1237208,1191768,3.67% >>> llvm:Dither,1238232,1192792,3.67% >>> event_management_2,4741096,4568584,3.64% >>> hashing_1,5189360,5004176,3.57% >>> compression_2,5329392,5140496,3.54% >>> math_2,6072336,5858992,3.51% >>> crypto_1,4721576,4556008,3.51% >>> math_4,27720208,26755568,3.48% >>> math_3,27732496,26767920,3.48% >>> infra_1,30673232,29630464,3.40% >>> hashing_2,5834544,5637168,3.38% >>> image_processing_entropy,5854960,5657008,3.38% >>> hashing_3,5831088,5634288,3.38% >>> memory_management_1,4957960,4790632,3.37% >>> hasing_4,5816048,5619888,3.37% >>> data_storage_2,5852976,5655792,3.37% >>> compression_1,5971984,5771120,3.36% >>> arena_unittest,6672944,6453584,3.29% >>> data_storage_3,44113232,42710960,3.18% >>> data_storage_1,22858992,22132960,3.18% >>> datastructures_1,6362032,6161264,3.16% >>> infra_2,56828720,55053496,3.12% >>> datastructures_2,164558704,160790320,2.29% >>> llvm:clang,77027416,75672088,1.76% >>> benchmark_infra,20221424,19944432,1.37% >>> search_1,169475360,167645632,1.08% >>> protobuf_3,24071504,23895440,0.73% >>> protobuf_1,24071504,23895504,0.73% >>> protobuf_2,24071504,23895504,0.73% >>> protobuf_4,24071504,23895504,0.73% >>> protobuf_5,24071504,23895504,0.73% >>> protobuf_6,23982800,23809424,0.72% >>> >>> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < >>> johannesdoerfert at gmail.com> wrote: >>> >>>> >>>> Cool! >>>> >>>> I skipped to the end and tried to access the gdoc and the spreadsheet >>>> but it did tell me I need permissions. Can you make them accessible or >>>> am I the problem? >>>> >>>> Thanks, >>>> Johannes >>>> >>>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >>>> > TL;DR; We can improve compiler optimizations driven by heuristics by >>>> > replacing those heuristics with machine-learned policies (ML models). >>>> > Policies are trained offline and ship as part of the compiler. >>>> Determinism >>>> > is maintained because models are fixed when the compiler is >>>> operating in >>>> > production. Fine-tuning or regressions may be handled by >>>> incorporating the >>>> > interesting cases in the ML training set, retraining the compiler, >>>> and >>>> > redeploying it. >>>> > >>>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. >>>> We >>>> were >>>> > able to train an ML model to produce binaries 1.5-6% smaller than >>>> -Oz of >>>> > tip-of-tree. The trained model appears to generalize well over a >>>> diverse >>>> > set of binaries. Compile time is increased marginally (under 5%). >>>> The >>>> model >>>> > also happens to produce slightly better - performing code under >>>> SPEC2006 - >>>> > total score improvement by 1.75%. As we only wanted to verify there >>>> is no >>>> > significant regression in SPEC, and given the milestone goals, we >>>> haven’t >>>> > dug any deeper into the speed results. >>>> > >>>> > We see these results as promising, and as a reasonable point for >>>> > contributing our current work as a build-time opt-in to LLVM to >>>> benefit the >>>> > community, in the hope of fostering collaboration and learning from >>>> the >>>> > community’s feedback, as we try to better understand the trade-offs >>>> such an >>>> > approach entails, and as we work on expanding the depth and breadth >>>> of >>>> > applying these techniques to compiler optimization problems. >>>> > >>>> > https://reviews.llvm.org/D77752 >>>> > Motivation >>>> > >>>> > Optimization problems, such as inlining or register allocation, are >>>> hard: >>>> > we don’t know of algorithms that are guaranteed to produce the >>>> optimum >>>> > solution in all cases in a timely fashion. Instead, we use >>>> heuristics: >>>> > algorithms that we expect to produce some approximation of the >>>> optimum, >>>> > with some expected degree of generality, in a practical amount of >>>> time. >>>> > >>>> > Heuristics have some common characteristics. Taking inlining as a >>>> case >>>> > study, it traverses the problem space in some way (bottom-up >>>> traversal of >>>> > the SCC graph), extracts some properties (let’s call them >>>> “features”) of >>>> > the program being optimized, and combine them with some weights >>>> (“tune”), >>>> > to produce a cost (InlineCost), which allows for trade-off analysis. >>>> We >>>> > validate the effectiveness of a heuristic over some accepted set of >>>> > benchmarks. Over time, we react to regressions or pathological cases >>>> > observed in the field, by manually analyzing such cases, figuring >>>> out an >>>> > enhancement to the heuristic, and then re-validating over that set of >>>> > benchmarks (maybe augmented by adding the newly found cases). >>>> > >>>> > Because heuristics are code that needs to be maintained, there is >>>> pressure >>>> > to reduce complexity: adding more features means we need to reason >>>> about >>>> > the interactions between the old and new features, which scales >>>> > combinatorially. Re-tuning because of the new features adds a >>>> similar >>>> kind >>>> > of complexity. Potentially, we miss out on optimization improvements >>>> as a >>>> > result. >>>> > >>>> > Because tuning is manual, there is pressure to keep the number of >>>> > benchmarks that can be studied in depth to a humanly-manageable >>>> size, >>>> which >>>> > potentially affects the generality of a heuristic or heuristic >>>> tuning. >>>> > >>>> > The main advantage of manual heuristics is arguably their relatively >>>> lower >>>> > overhead: no additional dependencies and more transparent to human >>>> analysis >>>> > and improvement. >>>> > >>>> > Machine learning, in particular reinforcement learning, can address >>>> the >>>> > tensions found in manual heuristics: once features are extracted >>>> from the >>>> > program, the way they are combined and tuned can easily be scaled up >>>> > through automation, improving effectiveness and generality. A major >>>> > drawback, at least at this point in time, of machine learning, is >>>> that we >>>> > don’t yet have a fully developed systematic approach for improving >>>> policy >>>> > effectiveness. >>>> > High level design >>>> > >>>> > We identify two scenarios for a compiler using ML policies: >>>> development and >>>> > release. >>>> > >>>> > The release scenario is equivalent to the regular compilation we >>>> have >>>> today >>>> > - the only difference is that it uses a pre-trained model (trained >>>> in the >>>> > development scenario beforehand) to make decisions instead of the >>>> > heuristics. Determinism is guaranteed since the model in the release >>>> > scenario is fixed. We imagine teams wishing to fine tune the >>>> effectiveness >>>> > of the optimization to their scenarios would train a different model. >>>> > >>>> > The decision previously evaluated using a human-crafted heuristic is >>>> > optionally replaced by: >>>> > >>>> > - >>>> > >>>> > a compiler-specific component, extracting features from IR (i.e. a >>>> > vector of values) >>>> > - >>>> > >>>> > an evaluation of an ML model using those features, to obtain a >>>> result. >>>> > In ML nomenclature, this is referred to using the model for >>>> inference (as >>>> > opposed to training it) >>>> > >>>> > For example, when we replaced the decision of whether to inline a >>>> callsite, >>>> > the ML model produces a boolean (inline/don’t inline) based on a >>>> features >>>> > vector characterizing the call site and some broader module-wide >>>> context. >>>> > >>>> > Training/development is more complicated, and happens offline - akin >>>> to >>>> > how, today, attempts to improve an optimizing pass also happen >>>> offline. A >>>> > description of the high level design and the specifics we used for >>>> the >>>> > current scope are given in Appendix. >>>> > Current Scope >>>> > >>>> > The goal of our first milestone was to evaluate end to end an >>>> integration >>>> > of ML with LLVM, and get a first promising result. To that end, we >>>> chose >>>> > inlining for size (-Oz) as a stepping stone, as we perceived it to >>>> be >>>> more >>>> > likely to require a simpler evaluation setup than >>>> performance-oriented >>>> > optimizations might. At this point, we only train whether a call >>>> site may >>>> > be inlined or not, leaving the SCC traversal order as-is. >>>> > >>>> > We are proposing an initial change demonstrating the inference >>>> mechanism >>>> > using a pre-trained model, as a build-time opt-in to llvm. The >>>> compiler >>>> > components needed to perform training are also included in this first >>>> > change. Subsequent changes would include more training-related >>>> components. >>>> > >>>> > At a high level, the changes we are proposing consist of: >>>> > >>>> > 1. >>>> > >>>> > a new module analysis pass, InliningAdvisor. By default, its >>>> > implementation does nothing. >>>> > 2. >>>> > >>>> > minimal hooks into Inliner.cpp. >>>> > 3. >>>> > >>>> > the implementation of InliningAdvisor, activated when we opt-in >>>> ML. This >>>> > is available in Analysis/ML, together with: >>>> > 1. >>>> > >>>> > Rel/Dev specific ML model handing, also under Analysis/ML >>>> > 2. >>>> > >>>> > a pre-trained model for inlining for size >>>> > (Analysis/ML/models/inlining) >>>> > 3. >>>> > >>>> > a pre-trained model for predicting native size from IR >>>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only. >>>> > 4. >>>> > >>>> > Some refactorings in PassBuilder, to allow opting into running >>>> mandatory >>>> > inlining first - some compilation speedup for the ML case, >>>> minimal, >>>> > noise-like size effect. Also simplifies testing (these would be >>>> introduced >>>> > as a preliminary patch). >>>> > >>>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is >>>> more >>>> > involved. >>>> > Inference Opt-In Mechanism >>>> > >>>> > The feature is primarily controlled by the cmake flag >>>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. >>>> The >>>> > “Rel”ease case requires specifying the location of the pip tensorflow >>>> > package (currently, that’s tf_nightly, and it should soon be >>>> available in >>>> > tensorflow) >>>> > >>>> > To opt in the ‘Rel’ case: >>>> > >>>> > 1. >>>> > >>>> > install tensorflow pip package >>>> > >>>> > pip3 install tf_nightly --user >>>> > >>>> > 1. >>>> > >>>> > configure llvm build >>>> > >>>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >>>> > >>>> > >>>> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \ >>>> > >>>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - >>>> it’s >>>> > usually in the path>} >>>> > >>>> > >>>> > >>>> > 1. >>>> > >>>> > build llvm as usual. >>>> > 2. >>>> > >>>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first >>>> to >>>> > clang. >>>> > >>>> > Details >>>> > >>>> > The ML model is captured as a TensorFlow ‘saved model’. When >>>> building >>>> llvm, >>>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to >>>> compile the >>>> > saved model into a native static library and a header file. Insofar >>>> as LLVM >>>> > is concerned, there are minimal additional runtime requirements, >>>> packaged >>>> > as source within the pip package: C++ wrappers around the compiled >>>> model. >>>> > These will also be statically linked in the LLVM target. The >>>> compiled >>>> code >>>> > is otherwise just a complex arithmetical computation, with no special >>>> > requirements - it is single threaded and runs natively on the >>>> targeted >>>> > architecture. Together with the aforementioned runtime dependencies, >>>> it >>>> > adds ~115KB to the clang binary (0.08% increase) >>>> > >>>> > Runtime-wise, we observed a ~10% increase in the time spent in the >>>> inliner, >>>> > for a large (33MB) binary IR module; inlining typically consumes >>>> ~10-15% of >>>> > total compilation time, so the overall compile time overhead of the >>>> > approach is arguably negligible. This cost is almost in entirety >>>> > attributable to feature extraction. >>>> > >>>> > Memory-wise, the precompiled model has a fixed size buffer for its >>>> inputs, >>>> > and performs a fixed amount of computations, so the memory overhead >>>> > inherent to our approach is independent from the program being >>>> optimized. >>>> > Using a small example to avoid effects such as memory use >>>> differences due >>>> > to different inlinings, we observed an 300KB increase in the maximum >>>> > resident size. >>>> > >>>> > A table showing effect on -Oz compiled binaries’ size is given in >>>> Appendix. >>>> > Next directions >>>> > >>>> > Our next milestone has two main high level goals: detailing a >>>> systematic >>>> > approach to driving policy effectiveness; and exploring in depth the >>>> type >>>> > of features and training algorithms most appropriate for compiler >>>> problems, >>>> > or at least problems like inlining. For the latter, we expect >>>> embedding >>>> > more of the call graph structure to play an important role, as well >>>> as, >>>> > potentially, delegating the problem space traversal to the ML model. >>>> > >>>> > We plan to include inlining for speed as part of our work on these >>>> goals. >>>> > AppendixTraining - High Level >>>> > >>>> > We use Reinforcement Learning (RL) to train the Inline-for-size >>>> model. At a >>>> > high level, it is composed of 3 parts: training data collection, >>>> model >>>> > training, and iterative data collection/model training. We use >>>> TensorFlow >>>> > as our ML framework. >>>> > >>>> > Related, we also needed to learn a separate model to evaluate the >>>> native >>>> > size of a function, given its IR, in order to calculate a more >>>> precise >>>> > reward for the reinforcement learning algorithm (“IR2Native”). We >>>> evaluated >>>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >>>> provide >>>> > too noisy of a signal for the reward, insofar as the RL training >>>> algorithm >>>> > for the inlining model was concerned. This model is only used during >>>> > training. >>>> > >>>> > RL - Training data collection: the training data we need to feed >>>> into a >>>> > reinforcement learning algorithm are sequences consisting of: state >>>> of the >>>> > problem (i.e. features); action (inline/not inline), and reward >>>> (native >>>> > size shrinkage after inline/not inline, using ir2native). To collect >>>> the >>>> > sequences, we hook the logging infrastructure into LLVM Inliner that >>>> is >>>> > able to produce logs after the inline optimization pass. >>>> > >>>> > RL - Model training: We use DQN (Deep Q-Network) to train our >>>> > inlining-for-size ML policy. On a high level, the DQN algorithm >>>> trains a >>>> > neural network to predict the value of different actions --- the DQN >>>> policy >>>> > then chooses to take the action with the highest predicted value. In >>>> our >>>> > scenario, we have two actions: 1) inline; 2) not inline, so the >>>> neural >>>> > network predicts the size reduction of these two actions based on >>>> features, >>>> > and then decides to conduct inlining if the neural network believes >>>> doing >>>> > inlining leads to higher size reduction. After the training >>>> finishes, it >>>> > produces a TensorFlow SavedModel that takes features as input and >>>> generates >>>> > inline decisions (whether to inline or not) as output. >>>> > >>>> > The choice of the features and reward are essential in model >>>> training. The >>>> > features are chosen with the consideration of being helpful in >>>> making the >>>> > decision --- the input to the cost model is a good starting point. >>>> Ideally, >>>> > the reward in the inline-for-size problem is the native size >>>> shrinkage >>>> > after inline/not inline. It is difficult to obtain precisely, so we >>>> use the >>>> > estimate as an alternative. This means that, for training, we also >>>> need a >>>> > model (not necessarily machine learned) for estimating rewards. >>>> > >>>> > RL - Iterative data collection/model training: Reinforcement >>>> learning is >>>> > ideally an iterative model/policy improvement process that: 1) the >>>> trained >>>> > model is deployed to the field to collect new data; 2) newly >>>> collected data >>>> > are used to update the model. Thus, we need to do iterative data >>>> > collection/model training. To facilitate data collection (avoid >>>> complex >>>> > build dependencies and time spent before/after the pass being >>>> trained), we >>>> > isolate out IR modules captured right before the optimization we are >>>> > interested in, and iterate on them with opt. We bootstrap the >>>> training from >>>> > the current heuristic, and stop the process once we are happy with >>>> the >>>> > outcome. >>>> > >>>> > IR2Native: We collect IR features (different from the ones used for >>>> > inlining) for each function at the end of inlining, right before we >>>> perform >>>> > function simplification passes, and right after. This means we have >>>> two IR >>>> > ‘shapes’ of the same function, and we know no further inlinings will >>>> be >>>> > performed, so whatever changes happen are based on that IR. We then >>>> extract >>>> > the native size at the end of compilation. Together, this data forms >>>> two >>>> > records per function that can be used in supervised learning - the >>>> features >>>> > are those extracted from IR, and the label is the native size. >>>> Training >>>> > IR2Native happens independently from the training of the inliner >>>> model. >>>> > Training support for the current scope >>>> > >>>> > The initial change includes the logging mechanism, an ir2native model >>>> > trained for x86-64, and the means to rapidly iterate over >>>> development ML >>>> > models. For the components that will be included in subsequent >>>> changes, the >>>> > rest of this section describes the mechanisms we employed. These >>>> components >>>> > are detailed further below. >>>> > >>>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the >>>> tensorflow C >>>> > API library <https://www.tensorflow.org/install/lang_c>. We then >>>> configure >>>> > the build: >>>> > >>>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >>>> > >>>> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >>>> > >>>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >>>> > library over, if it’s not on LD_LIBRARY_PATH} >>>> > >>>> > >>>> > To extract IR right before inlining, we hacked on top of the ThinLTO >>>> > infrastructure, interrupting its pre-link pipeline right before >>>> inlining. >>>> > This means clang produces binary IR files captured at that stage. We >>>> then >>>> > built a large target, obtaining a corpus of ~25K modules. We intend >>>> to >>>> > provide a clean mechanism in a subsequent change. >>>> > >>>> > To obtain features/labels for training this “IR to Native Size” >>>> model, we >>>> > had to make some changes to the AsmPrinter (to get real native >>>> sizes) and >>>> > llvm-readobj, as well as add an analysis pass for extracting the IR >>>> > features for this model. We plan on upstreaming these changes >>>> subsequently. >>>> > >>>> > Finally, the infrastructure driving the policy training is currently >>>> built >>>> > on proprietary APIs, as it benefits from a distributed computing >>>> > infrastructure. We are currently evaluating options for open >>>> sourcing it. >>>> > In the meantime, we are presenting the high level implementation >>>> details. >>>> > >>>> > To train a new model, the infrastructure performs 2 steps: >>>> extracting the >>>> > logs, and using them in a training algorithm. >>>> > >>>> > Log extraction is highly parallelizable: for each IR module in the >>>> training >>>> > corpus, we need to run opt once (or a few times, when we explore >>>> > improvements). Specifically, each run is this invocation: >>>> > >>>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path >>>> to >>>> > ir2native> -training-log=<path to training log output> >>>> -enable-ml-inliner >>>> > -mandatory-inlinings-first -o <output> <module.o> >>>> > >>>> > Then collect the logs, and pass them to the next step. >>>> > >>>> > >>>> > Experimental results >>>> > >>>> > Experimental results are available as follows: >>>> > >>>> > >>>> > - >>>> > >>>> > SPEC2006 >>>> > >>>> < >>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >>>> > >>>> > binary sizes (-Oz) and ‘test run’ scores. >>>> > - >>>> > >>>> > Size report >>>> > >>>> < >>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >>>> > >>>> > from some internal benchmarks and binaries, including opt and >>>> clang >>>> > >>>> > >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200409/3845593a/attachment-0001.html>
Mircea Trofin via llvm-dev
2020-Apr-09 17:01 UTC
[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
Sorry, I wasn't aware of that. I can make the google doc view-only, keeping the current comments. I'll wait a bit (few hrs) to see if there's any pushback to that. On Thu, Apr 9, 2020 at 9:57 AM Xinliang David Li <xinliangli at gmail.com> wrote:> One suggestion : should we consolidate the discussion into the main > thread? I know some folks are not willing to comment in Google docs. > > David > > On Wed, Apr 8, 2020 at 7:06 PM Mircea Trofin via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> +Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> , our >> team members from the ML side. >> >> To avoid formatting issues, here is a link to the RFC >> <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>, >> open to comments. >> >> Thanks! >> >> On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com> wrote: >> >>> Unfortunately I needed to update the links. Here they are, hopefully >>> these work correctly. >>> >>> SPEC2006 >>> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true> >>> >>> Internal benchmarks, clang, opt >>> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true> >>> >>> Thanks! >>> >>> On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote: >>> >>>> It turns out it's me, sorry. Let me see how I can sort this out. In the >>>> meantime, here is the csv: >>>> >>>> SPEC2006 data: >>>> >>>> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz >>>> scores,perf: ML -Oz scores,ML improvement by >>>> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00% >>>> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13% >>>> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86% >>>> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35% >>>> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00% >>>> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27% >>>> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13% >>>> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02% >>>> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00% >>>> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61% >>>> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00% >>>> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36% >>>> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81% >>>> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00% >>>> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70% >>>> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79% >>>> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33% >>>> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21% >>>> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03% >>>> ,,,,,,, >>>> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75% >>>> >>>> >>>> various benchmarks: >>>> >>>> binary,base -Oz size,ML -Oz size,ML shrink over base >>>> llvm:opt,66318624,62124256,6.32% >>>> math_1,6181552,5884144,4.81% >>>> event_management_1,4998728,4802696,3.92% >>>> llvm:lcalsBRaw,1270168,1222168,3.78% >>>> llvm:lcalsBLambda,1270232,1222232,3.78% >>>> llvm:lcalsARaw,1276248,1228248,3.76% >>>> llvm:lcalsALambda,1276440,1228504,3.76% >>>> llvm:lcalsCRaw,1278936,1231000,3.75% >>>> llvm:lcalsCLambda,1279064,1231256,3.74% >>>> llvm:Blur,1236888,1191192,3.69% >>>> image_processing_diffusion,1236120,1190488,3.69% >>>> llvm:Interpolation,1237208,1191576,3.69% >>>> llvm:harris,1237208,1191768,3.67% >>>> llvm:Dither,1238232,1192792,3.67% >>>> event_management_2,4741096,4568584,3.64% >>>> hashing_1,5189360,5004176,3.57% >>>> compression_2,5329392,5140496,3.54% >>>> math_2,6072336,5858992,3.51% >>>> crypto_1,4721576,4556008,3.51% >>>> math_4,27720208,26755568,3.48% >>>> math_3,27732496,26767920,3.48% >>>> infra_1,30673232,29630464,3.40% >>>> hashing_2,5834544,5637168,3.38% >>>> image_processing_entropy,5854960,5657008,3.38% >>>> hashing_3,5831088,5634288,3.38% >>>> memory_management_1,4957960,4790632,3.37% >>>> hasing_4,5816048,5619888,3.37% >>>> data_storage_2,5852976,5655792,3.37% >>>> compression_1,5971984,5771120,3.36% >>>> arena_unittest,6672944,6453584,3.29% >>>> data_storage_3,44113232,42710960,3.18% >>>> data_storage_1,22858992,22132960,3.18% >>>> datastructures_1,6362032,6161264,3.16% >>>> infra_2,56828720,55053496,3.12% >>>> datastructures_2,164558704,160790320,2.29% >>>> llvm:clang,77027416,75672088,1.76% >>>> benchmark_infra,20221424,19944432,1.37% >>>> search_1,169475360,167645632,1.08% >>>> protobuf_3,24071504,23895440,0.73% >>>> protobuf_1,24071504,23895504,0.73% >>>> protobuf_2,24071504,23895504,0.73% >>>> protobuf_4,24071504,23895504,0.73% >>>> protobuf_5,24071504,23895504,0.73% >>>> protobuf_6,23982800,23809424,0.72% >>>> >>>> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert < >>>> johannesdoerfert at gmail.com> wrote: >>>> >>>>> >>>>> Cool! >>>>> >>>>> I skipped to the end and tried to access the gdoc and the spreadsheet >>>>> but it did tell me I need permissions. Can you make them accessible or >>>>> am I the problem? >>>>> >>>>> Thanks, >>>>> Johannes >>>>> >>>>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote: >>>>> > TL;DR; We can improve compiler optimizations driven by heuristics by >>>>> > replacing those heuristics with machine-learned policies (ML >>>>> models). >>>>> > Policies are trained offline and ship as part of the compiler. >>>>> Determinism >>>>> > is maintained because models are fixed when the compiler is >>>>> operating in >>>>> > production. Fine-tuning or regressions may be handled by >>>>> incorporating the >>>>> > interesting cases in the ML training set, retraining the compiler, >>>>> and >>>>> > redeploying it. >>>>> > >>>>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. >>>>> We >>>>> were >>>>> > able to train an ML model to produce binaries 1.5-6% smaller than >>>>> -Oz of >>>>> > tip-of-tree. The trained model appears to generalize well over a >>>>> diverse >>>>> > set of binaries. Compile time is increased marginally (under 5%). >>>>> The >>>>> model >>>>> > also happens to produce slightly better - performing code under >>>>> SPEC2006 - >>>>> > total score improvement by 1.75%. As we only wanted to verify there >>>>> is no >>>>> > significant regression in SPEC, and given the milestone goals, we >>>>> haven’t >>>>> > dug any deeper into the speed results. >>>>> > >>>>> > We see these results as promising, and as a reasonable point for >>>>> > contributing our current work as a build-time opt-in to LLVM to >>>>> benefit the >>>>> > community, in the hope of fostering collaboration and learning from >>>>> the >>>>> > community’s feedback, as we try to better understand the trade-offs >>>>> such an >>>>> > approach entails, and as we work on expanding the depth and breadth >>>>> of >>>>> > applying these techniques to compiler optimization problems. >>>>> > >>>>> > https://reviews.llvm.org/D77752 >>>>> > Motivation >>>>> > >>>>> > Optimization problems, such as inlining or register allocation, are >>>>> hard: >>>>> > we don’t know of algorithms that are guaranteed to produce the >>>>> optimum >>>>> > solution in all cases in a timely fashion. Instead, we use >>>>> heuristics: >>>>> > algorithms that we expect to produce some approximation of the >>>>> optimum, >>>>> > with some expected degree of generality, in a practical amount of >>>>> time. >>>>> > >>>>> > Heuristics have some common characteristics. Taking inlining as a >>>>> case >>>>> > study, it traverses the problem space in some way (bottom-up >>>>> traversal of >>>>> > the SCC graph), extracts some properties (let’s call them >>>>> “features”) of >>>>> > the program being optimized, and combine them with some weights >>>>> (“tune”), >>>>> > to produce a cost (InlineCost), which allows for trade-off >>>>> analysis. We >>>>> > validate the effectiveness of a heuristic over some accepted set of >>>>> > benchmarks. Over time, we react to regressions or pathological cases >>>>> > observed in the field, by manually analyzing such cases, figuring >>>>> out an >>>>> > enhancement to the heuristic, and then re-validating over that set >>>>> of >>>>> > benchmarks (maybe augmented by adding the newly found cases). >>>>> > >>>>> > Because heuristics are code that needs to be maintained, there is >>>>> pressure >>>>> > to reduce complexity: adding more features means we need to reason >>>>> about >>>>> > the interactions between the old and new features, which scales >>>>> > combinatorially. Re-tuning because of the new features adds a >>>>> similar >>>>> kind >>>>> > of complexity. Potentially, we miss out on optimization >>>>> improvements as a >>>>> > result. >>>>> > >>>>> > Because tuning is manual, there is pressure to keep the number of >>>>> > benchmarks that can be studied in depth to a humanly-manageable >>>>> size, >>>>> which >>>>> > potentially affects the generality of a heuristic or heuristic >>>>> tuning. >>>>> > >>>>> > The main advantage of manual heuristics is arguably their >>>>> relatively >>>>> lower >>>>> > overhead: no additional dependencies and more transparent to human >>>>> analysis >>>>> > and improvement. >>>>> > >>>>> > Machine learning, in particular reinforcement learning, can address >>>>> the >>>>> > tensions found in manual heuristics: once features are extracted >>>>> from the >>>>> > program, the way they are combined and tuned can easily be scaled up >>>>> > through automation, improving effectiveness and generality. A major >>>>> > drawback, at least at this point in time, of machine learning, is >>>>> that we >>>>> > don’t yet have a fully developed systematic approach for improving >>>>> policy >>>>> > effectiveness. >>>>> > High level design >>>>> > >>>>> > We identify two scenarios for a compiler using ML policies: >>>>> development and >>>>> > release. >>>>> > >>>>> > The release scenario is equivalent to the regular compilation we >>>>> have >>>>> today >>>>> > - the only difference is that it uses a pre-trained model (trained >>>>> in the >>>>> > development scenario beforehand) to make decisions instead of the >>>>> > heuristics. Determinism is guaranteed since the model in the release >>>>> > scenario is fixed. We imagine teams wishing to fine tune the >>>>> effectiveness >>>>> > of the optimization to their scenarios would train a different >>>>> model. >>>>> > >>>>> > The decision previously evaluated using a human-crafted heuristic is >>>>> > optionally replaced by: >>>>> > >>>>> > - >>>>> > >>>>> > a compiler-specific component, extracting features from IR (i.e. >>>>> a >>>>> > vector of values) >>>>> > - >>>>> > >>>>> > an evaluation of an ML model using those features, to obtain a >>>>> result. >>>>> > In ML nomenclature, this is referred to using the model for >>>>> inference (as >>>>> > opposed to training it) >>>>> > >>>>> > For example, when we replaced the decision of whether to inline a >>>>> callsite, >>>>> > the ML model produces a boolean (inline/don’t inline) based on a >>>>> features >>>>> > vector characterizing the call site and some broader module-wide >>>>> context. >>>>> > >>>>> > Training/development is more complicated, and happens offline - >>>>> akin to >>>>> > how, today, attempts to improve an optimizing pass also happen >>>>> offline. A >>>>> > description of the high level design and the specifics we used for >>>>> the >>>>> > current scope are given in Appendix. >>>>> > Current Scope >>>>> > >>>>> > The goal of our first milestone was to evaluate end to end an >>>>> integration >>>>> > of ML with LLVM, and get a first promising result. To that end, we >>>>> chose >>>>> > inlining for size (-Oz) as a stepping stone, as we perceived it to >>>>> be >>>>> more >>>>> > likely to require a simpler evaluation setup than >>>>> performance-oriented >>>>> > optimizations might. At this point, we only train whether a call >>>>> site may >>>>> > be inlined or not, leaving the SCC traversal order as-is. >>>>> > >>>>> > We are proposing an initial change demonstrating the inference >>>>> mechanism >>>>> > using a pre-trained model, as a build-time opt-in to llvm. The >>>>> compiler >>>>> > components needed to perform training are also included in this >>>>> first >>>>> > change. Subsequent changes would include more training-related >>>>> components. >>>>> > >>>>> > At a high level, the changes we are proposing consist of: >>>>> > >>>>> > 1. >>>>> > >>>>> > a new module analysis pass, InliningAdvisor. By default, its >>>>> > implementation does nothing. >>>>> > 2. >>>>> > >>>>> > minimal hooks into Inliner.cpp. >>>>> > 3. >>>>> > >>>>> > the implementation of InliningAdvisor, activated when we opt-in >>>>> ML. This >>>>> > is available in Analysis/ML, together with: >>>>> > 1. >>>>> > >>>>> > Rel/Dev specific ML model handing, also under Analysis/ML >>>>> > 2. >>>>> > >>>>> > a pre-trained model for inlining for size >>>>> > (Analysis/ML/models/inlining) >>>>> > 3. >>>>> > >>>>> > a pre-trained model for predicting native size from IR >>>>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode >>>>> only. >>>>> > 4. >>>>> > >>>>> > Some refactorings in PassBuilder, to allow opting into running >>>>> mandatory >>>>> > inlining first - some compilation speedup for the ML case, >>>>> minimal, >>>>> > noise-like size effect. Also simplifies testing (these would be >>>>> introduced >>>>> > as a preliminary patch). >>>>> > >>>>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it >>>>> is more >>>>> > involved. >>>>> > Inference Opt-In Mechanism >>>>> > >>>>> > The feature is primarily controlled by the cmake flag >>>>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. >>>>> The >>>>> > “Rel”ease case requires specifying the location of the pip >>>>> tensorflow >>>>> > package (currently, that’s tf_nightly, and it should soon be >>>>> available in >>>>> > tensorflow) >>>>> > >>>>> > To opt in the ‘Rel’ case: >>>>> > >>>>> > 1. >>>>> > >>>>> > install tensorflow pip package >>>>> > >>>>> > pip3 install tf_nightly --user >>>>> > >>>>> > 1. >>>>> > >>>>> > configure llvm build >>>>> > >>>>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \ >>>>> > >>>>> > >>>>> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \ >>>>> > >>>>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - >>>>> it’s >>>>> > usually in the path>} >>>>> > >>>>> > >>>>> > >>>>> > 1. >>>>> > >>>>> > build llvm as usual. >>>>> > 2. >>>>> > >>>>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first >>>>> to >>>>> > clang. >>>>> > >>>>> > Details >>>>> > >>>>> > The ML model is captured as a TensorFlow ‘saved model’. When >>>>> building >>>>> llvm, >>>>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to >>>>> compile the >>>>> > saved model into a native static library and a header file. Insofar >>>>> as LLVM >>>>> > is concerned, there are minimal additional runtime requirements, >>>>> packaged >>>>> > as source within the pip package: C++ wrappers around the compiled >>>>> model. >>>>> > These will also be statically linked in the LLVM target. The >>>>> compiled >>>>> code >>>>> > is otherwise just a complex arithmetical computation, with no >>>>> special >>>>> > requirements - it is single threaded and runs natively on the >>>>> targeted >>>>> > architecture. Together with the aforementioned runtime >>>>> dependencies, it >>>>> > adds ~115KB to the clang binary (0.08% increase) >>>>> > >>>>> > Runtime-wise, we observed a ~10% increase in the time spent in the >>>>> inliner, >>>>> > for a large (33MB) binary IR module; inlining typically consumes >>>>> ~10-15% of >>>>> > total compilation time, so the overall compile time overhead of the >>>>> > approach is arguably negligible. This cost is almost in entirety >>>>> > attributable to feature extraction. >>>>> > >>>>> > Memory-wise, the precompiled model has a fixed size buffer for its >>>>> inputs, >>>>> > and performs a fixed amount of computations, so the memory overhead >>>>> > inherent to our approach is independent from the program being >>>>> optimized. >>>>> > Using a small example to avoid effects such as memory use >>>>> differences due >>>>> > to different inlinings, we observed an 300KB increase in the maximum >>>>> > resident size. >>>>> > >>>>> > A table showing effect on -Oz compiled binaries’ size is given in >>>>> Appendix. >>>>> > Next directions >>>>> > >>>>> > Our next milestone has two main high level goals: detailing a >>>>> systematic >>>>> > approach to driving policy effectiveness; and exploring in depth >>>>> the type >>>>> > of features and training algorithms most appropriate for compiler >>>>> problems, >>>>> > or at least problems like inlining. For the latter, we expect >>>>> embedding >>>>> > more of the call graph structure to play an important role, as well >>>>> as, >>>>> > potentially, delegating the problem space traversal to the ML model. >>>>> > >>>>> > We plan to include inlining for speed as part of our work on these >>>>> goals. >>>>> > AppendixTraining - High Level >>>>> > >>>>> > We use Reinforcement Learning (RL) to train the Inline-for-size >>>>> model. At a >>>>> > high level, it is composed of 3 parts: training data collection, >>>>> model >>>>> > training, and iterative data collection/model training. We use >>>>> TensorFlow >>>>> > as our ML framework. >>>>> > >>>>> > Related, we also needed to learn a separate model to evaluate the >>>>> native >>>>> > size of a function, given its IR, in order to calculate a more >>>>> precise >>>>> > reward for the reinforcement learning algorithm (“IR2Native”). We >>>>> evaluated >>>>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to >>>>> provide >>>>> > too noisy of a signal for the reward, insofar as the RL training >>>>> algorithm >>>>> > for the inlining model was concerned. This model is only used during >>>>> > training. >>>>> > >>>>> > RL - Training data collection: the training data we need to feed >>>>> into a >>>>> > reinforcement learning algorithm are sequences consisting of: state >>>>> of the >>>>> > problem (i.e. features); action (inline/not inline), and reward >>>>> (native >>>>> > size shrinkage after inline/not inline, using ir2native). To >>>>> collect the >>>>> > sequences, we hook the logging infrastructure into LLVM Inliner >>>>> that is >>>>> > able to produce logs after the inline optimization pass. >>>>> > >>>>> > RL - Model training: We use DQN (Deep Q-Network) to train our >>>>> > inlining-for-size ML policy. On a high level, the DQN algorithm >>>>> trains a >>>>> > neural network to predict the value of different actions --- the >>>>> DQN >>>>> policy >>>>> > then chooses to take the action with the highest predicted value. >>>>> In our >>>>> > scenario, we have two actions: 1) inline; 2) not inline, so the >>>>> neural >>>>> > network predicts the size reduction of these two actions based on >>>>> features, >>>>> > and then decides to conduct inlining if the neural network believes >>>>> doing >>>>> > inlining leads to higher size reduction. After the training >>>>> finishes, it >>>>> > produces a TensorFlow SavedModel that takes features as input and >>>>> generates >>>>> > inline decisions (whether to inline or not) as output. >>>>> > >>>>> > The choice of the features and reward are essential in model >>>>> training. The >>>>> > features are chosen with the consideration of being helpful in >>>>> making the >>>>> > decision --- the input to the cost model is a good starting point. >>>>> Ideally, >>>>> > the reward in the inline-for-size problem is the native size >>>>> shrinkage >>>>> > after inline/not inline. It is difficult to obtain precisely, so we >>>>> use the >>>>> > estimate as an alternative. This means that, for training, we also >>>>> need a >>>>> > model (not necessarily machine learned) for estimating rewards. >>>>> > >>>>> > RL - Iterative data collection/model training: Reinforcement >>>>> learning is >>>>> > ideally an iterative model/policy improvement process that: 1) the >>>>> trained >>>>> > model is deployed to the field to collect new data; 2) newly >>>>> collected data >>>>> > are used to update the model. Thus, we need to do iterative data >>>>> > collection/model training. To facilitate data collection (avoid >>>>> complex >>>>> > build dependencies and time spent before/after the pass being >>>>> trained), we >>>>> > isolate out IR modules captured right before the optimization we are >>>>> > interested in, and iterate on them with opt. We bootstrap the >>>>> training from >>>>> > the current heuristic, and stop the process once we are happy with >>>>> the >>>>> > outcome. >>>>> > >>>>> > IR2Native: We collect IR features (different from the ones used for >>>>> > inlining) for each function at the end of inlining, right before we >>>>> perform >>>>> > function simplification passes, and right after. This means we have >>>>> two IR >>>>> > ‘shapes’ of the same function, and we know no further inlinings >>>>> will be >>>>> > performed, so whatever changes happen are based on that IR. We then >>>>> extract >>>>> > the native size at the end of compilation. Together, this data >>>>> forms two >>>>> > records per function that can be used in supervised learning - the >>>>> features >>>>> > are those extracted from IR, and the label is the native size. >>>>> Training >>>>> > IR2Native happens independently from the training of the inliner >>>>> model. >>>>> > Training support for the current scope >>>>> > >>>>> > The initial change includes the logging mechanism, an ir2native >>>>> model >>>>> > trained for x86-64, and the means to rapidly iterate over >>>>> development ML >>>>> > models. For the components that will be included in subsequent >>>>> changes, the >>>>> > rest of this section describes the mechanisms we employed. These >>>>> components >>>>> > are detailed further below. >>>>> > >>>>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the >>>>> tensorflow C >>>>> > API library <https://www.tensorflow.org/install/lang_c>. We then >>>>> configure >>>>> > the build: >>>>> > >>>>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \ >>>>> > >>>>> > -DLLVM_TF_C_LIB=<path to unarchived package> \ >>>>> > >>>>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared >>>>> > library over, if it’s not on LD_LIBRARY_PATH} >>>>> > >>>>> > >>>>> > To extract IR right before inlining, we hacked on top of the ThinLTO >>>>> > infrastructure, interrupting its pre-link pipeline right before >>>>> inlining. >>>>> > This means clang produces binary IR files captured at that stage. >>>>> We then >>>>> > built a large target, obtaining a corpus of ~25K modules. We intend >>>>> to >>>>> > provide a clean mechanism in a subsequent change. >>>>> > >>>>> > To obtain features/labels for training this “IR to Native Size” >>>>> model, we >>>>> > had to make some changes to the AsmPrinter (to get real native >>>>> sizes) and >>>>> > llvm-readobj, as well as add an analysis pass for extracting the IR >>>>> > features for this model. We plan on upstreaming these changes >>>>> subsequently. >>>>> > >>>>> > Finally, the infrastructure driving the policy training is >>>>> currently >>>>> built >>>>> > on proprietary APIs, as it benefits from a distributed computing >>>>> > infrastructure. We are currently evaluating options for open >>>>> sourcing it. >>>>> > In the meantime, we are presenting the high level implementation >>>>> details. >>>>> > >>>>> > To train a new model, the infrastructure performs 2 steps: >>>>> extracting the >>>>> > logs, and using them in a training algorithm. >>>>> > >>>>> > Log extraction is highly parallelizable: for each IR module in the >>>>> training >>>>> > corpus, we need to run opt once (or a few times, when we explore >>>>> > improvements). Specifically, each run is this invocation: >>>>> > >>>>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path >>>>> to >>>>> > ir2native> -training-log=<path to training log output> >>>>> -enable-ml-inliner >>>>> > -mandatory-inlinings-first -o <output> <module.o> >>>>> > >>>>> > Then collect the logs, and pass them to the next step. >>>>> > >>>>> > >>>>> > Experimental results >>>>> > >>>>> > Experimental results are available as follows: >>>>> > >>>>> > >>>>> > - >>>>> > >>>>> > SPEC2006 >>>>> > >>>>> < >>>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true >>>>> > >>>>> > binary sizes (-Oz) and ‘test run’ scores. >>>>> > - >>>>> > >>>>> > Size report >>>>> > >>>>> < >>>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true >>>>> > >>>>> > from some internal benchmarks and binaries, including opt and >>>>> clang >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > LLVM Developers mailing list >>>>> > llvm-dev at lists.llvm.org >>>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200409/20353ada/attachment.html>
Maybe Matching Threads
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
- [RFC] Introduce Dump Accumulator
- [RFC] Introduce Dump Accumulator