On Wed, Jun 3, 2015 at 4:19 AM, Dave Bozier <seifsta at gmail.com> wrote:> Hi Teresa, > > Thanks for providing this updated RFC. > >> For Sony's linker, are you using the gold plugin or libLTO interfaces? >> If the latter, I suppose some ThinLTO handling would have to be added >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 >> combined function map and either launch the backend processes in >> parallel or write out a make or other build file). The current support >> for reading native object wrapped bitcode is baked into IRObjectFile >> so presumably the Sony linker can handle these native object wrapped >> bitcode files if it uses libLTO. We would similarly embed the handling >> of the function index/summary behind an API that can handle either so >> it is similarly transparent to the linkers. Let me know if there would >> be additional issues that make wrapped bitcode more difficult in your >> case, or how we could make ThinLTO usage simpler for you in general. > We use the libLTO interfaces.Hi Dave, Thanks for the info.> > We use the libLTO interfaces, more specifically we use the C API > located in llvm-c\lto.h. > > Our linker won't support native object wrapped bitcode files as our > LTO is it currently stands. Right now, it will be recognized as an > object file and won't get anywhere near the libLTO libraries. We'd > need to teach our linker to recognize and differentiate native object > wrapped bitcode files and regular native object files. This isn't > straight forward as we cannot distinguish them just by looking at the > file header alone, we would need to parse the sections and look for a > .llvmbc section. We then need to add special handling of these native > object wrappers.Ok, I see. Does it help that there are LTOModule (lto_module_* in the C API) interfaces for checking if a file contains bitcode (regardless of whether it is straight-up or native-wrapped)? I don't know how hard in your linker it is to query these when deciding whether to treat the object file as bitcode or not, or how hard it is to pass the resulting object file along to the libLTO routines for handling (they automatically handle the native-wrapped object files so the linker shouldn't have to do anything special to read them). Specifically, in the C API these are the lto_module_is_object_file* variants, which will return true for either straight-up or native-wrapped bitcode. All of the mechanics of handling bitcode vs native object-wrapped bitcode are down in the IRObjectFile handling. So the LTOModule:isBitcode*/lto_module_is_object_file* will correctly identify native object-wrapped bitcode as bitcode. And the LTOModule::createFrom*/lto_module_create* routines correctly parse the native object-wrapped bitcode and return an LTOmodule. As a result, the llvm-lto tool that also uses libLTO interfaces didn't require any changes when the native-wrapped reading support went in (r218078), and is able to handle native-wrapped bitcode out of the box.> > Handling the function index/summary behind an API sounds like a good idea.I am going to work on fleshing out this part next so that the actual format of the files is hidden from clients. Thanks, Teresa> > On Fri, May 29, 2015 at 4:01 PM, Teresa Johnson <tejohnson at google.com> wrote: >> On Fri, May 29, 2015 at 6:56 AM, Alex Rosenberg <alexr at leftfield.org> wrote: >>> My earlier statement about wrapping things in a native object file held in that it is controversial. It appears to be still central to your design. >>> >>> It may help to look at the problem from a different viewpoint: LLVM is not a compiler. It is a framework that can be used to make compiler-like tools. >>> >>> From that view, it no longer makes sense to discuss "the plugin," or gold, or $AR, because there isn't just one of any of those things. ld64 isn't the only outlier linker to consider. We have our own linker at Sony, for example. From this perspective, then it makes more sense to consider replacing the binary utilities with ones that support bitcode, because from a user-perspective, all of the linkers already transparently support bitcode directly today, as do ar, nm, etc. This has been necessary for the regular LTO process. >> >> Hi Alex, >> >> It's true that the LLVM versions of these tools support bitcode >> transparently, but not all build systems use LLVM versions of these >> tools, particularly build systems that support a variety of compilers, >> or legacy build systems. And not all build systems have the plugin or >> currently pass it to the native tools that can take a plugin for >> handling bitcode. In those cases the bitcode support is not >> transparently available, and our aim is to reduce the friction as much >> as possible. And not all use LTO currently (I know we don't due to the >> scalability issues we're trying to address with this design), and in >> those cases the migration to bitcode-aware tools and plugins was not >> previously required. >> >> For Sony's linker, are you using the gold plugin or libLTO interfaces? >> If the latter, I suppose some ThinLTO handling would have to be added >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 >> combined function map and either launch the backend processes in >> parallel or write out a make or other build file). The current support >> for reading native object wrapped bitcode is baked into IRObjectFile >> so presumably the Sony linker can handle these native object wrapped >> bitcode files if it uses libLTO. We would similarly embed the handling >> of the function index/summary behind an API that can handle either so >> it is similarly transparent to the linkers. Let me know if there would >> be additional issues that make wrapped bitcode more difficult in your >> case, or how we could make ThinLTO usage simpler for you in general. >> >>> >>> The only tool in the list of tools you mentioned that do not support bitcode directly is objcopy, and that's because nobody has yet written an LLVM-project implementation of it. Personally, I'd much rather you focus on making ThinLTO work by extending bitcode as needed, and we work as a community toward replacing objcopy with an LLVM-native one. It's a big missing piece of the LLVM project today and could be so much better if we could use it to replace Apple's lipo and possibly other extant object file modification tools. (Has anyone surveyed this area?) >>> >>> That older toolchains have tried to slip non-object file data through the binary utilities isn't really proof that this is a good choice. It might simply reflect the realities of those engineering teams. I wasn't at Sun for this, but DTrace needed a linker feature that apparently the Sun linker team was unwilling or unable to provide, so dtrace(1) gained the ability to modify ELF files directly as needed. That doesn't prove that DTrace's USDT feature shouldn't have been implemented in the linker (as ld64 does directly for Apple), does it? >> >> I'd argue that the realities being addressed by using native object >> format in those cases still exist. >> >>> >>> If in the end using native object-wrapped bitcode is the best solution, so be it. However, I think it is largely orthogonal to ThinLTO's needs for transporting symtab data alongside the existing bitcode format. >> >> That's certainly true, ThinLTO can be implemented using either format, >> and bitcode only support can certainly be implemented. It is a matter >> of prioritizing which format to implement first. I had added some >> description to the updated RFC on how the function index/summary can >> be represented, etc in bitcode. Prioritizing the native object format >> doesn't make it easier to implement ThinLTO, but should make it easier >> to deploy. >> >> Thanks! >> Teresa >> >>> >>> Alex >>> >>>> On May 28, 2015, at 2:10 PM, Teresa Johnson <tejohnson at google.com> wrote: >>>> >>>> As promised, here is an new version of the ThinLTO RFC, updated based >>>> on some of the comments, questions and feedback from the first RFC. >>>> Hopefully we have addressed many of these, and as noted below, will >>>> fork some of the detailed discussion on particular aspects into >>>> separate design doc threads. Please send any additional feedback and >>>> questions on the overall design. >>>> Thanks! >>>> Teresa >>>> >>>> >>>> Updated RFC to discuss plans for implementing ThinLTO upstream, >>>> reflecting feedback and discussion from initial RFC >>>> (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085557.html). As >>>> discussed in the earlier thread and below, more detailed design >>>> documents for several pieces (native object format, linkage type >>>> changes and static promotions, etc) are in progress and will be sent >>>> separately. This RFC covers the overall design and the breakdown of >>>> work at a higher level. >>>> >>>> >>>> Background on ThinLTO can be found in slides from EuroLLVM 2015: >>>> https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0 >>>> As described in the talk, we have a prototype implementation, and >>>> would like to start staging patches upstream. This RFC describes a >>>> breakdown of the major pieces. We would like to commit upstream >>>> gradually in several stages, with all functionality off by default. >>>> The core ThinLTO importing support and tuning will require frequent >>>> change and iteration during testing and tuning, and for that part we >>>> would like to commit rapidly (off by default). See the proposed staged >>>> implementation described in the Implementation Plan section. >>>> >>>> >>>> ThinLTO Overview >>>> =================>>>> >>>> >>>> See the talk slides linked above for more details. The following is a >>>> high-level overview of the motivation. >>>> >>>> >>>> Cross Module Optimization (CMO) is an effective means for improving >>>> runtime performance, by extending the scope of optimizations across >>>> source module boundaries. Without CMO, the compiler is limited to >>>> optimizing within the scope of single source modules. Two solutions >>>> for enabling CMO are Link-Time Optimization (LTO), which is currently >>>> supported in LLVM and GCC, and Lightweight-Interprocedural >>>> Optimization (LIPO). However, each of these solutions has limitations >>>> that prevent it from being enabled by default. ThinLTO is a new >>>> approach that attempts to address these limitations, with a goal of >>>> being enabled more broadly. ThinLTO is designed with many of the same >>>> principals as LIPO, and therefore its advantages, without any of its >>>> inherent weakness. Unlike in LIPO where the module group decision is >>>> made at profile training runtime, ThinLTO makes the decision at >>>> compile time, but in a lazy mode that facilitates large scale >>>> parallelism. LTO implementations all contain a serial IPA/IPO step >>>> that is both memory intensive and slow, limiting usability on both >>>> smaller workstations and huge applications. In contrast, the ThinLTO >>>> serial linker plugin phase is designed to be razor thin and blazingly >>>> fast. By default this step only does minimal preparation work to >>>> enable the parallel lazy importing performed later. ThinLTO aims to be >>>> scalable like a regular O2 build, enabling CMO on machines without >>>> large memory configurations, while also integrating well with >>>> distributed build systems. Results from early prototyping on SPEC >>>> cpu2006 C++ benchmarks are in line with expectations that ThinLTO can >>>> scale like O2 while enabling much of the CMO performed during a full >>>> LTO build. >>>> >>>> >>>> A ThinLTO build is divided into 3 phases, which are referred to in the >>>> following implementation plan: >>>> 1. phase-1: IR and Function Summary Generation (-c compile) >>>> 2. phase-2: Thin Linker Plugin Layer (thin archive linker step) >>>> 3. phase-3: Parallel Backend with Demand-Driven Importing >>>> >>>> >>>> Implementation Plan >>>> ===================>>>> >>>> >>>> This section gives a high-level breakdown of the ThinLTO support that >>>> will be added, in roughly the order that the patches would be staged. >>>> The patches are divided into three stages. The first stage contains a >>>> minimal amount of preparation work that is not ThinLTO-specific. The >>>> second stage contains most of the infrastructure for ThinLTO, which >>>> will be off by default. The third stage includes >>>> enhancements/improvements/tunings that can be performed after the main >>>> ThinLTO infrastructure is in. >>>> >>>> >>>> The second and third implementation stages will initially be very >>>> volatile, requiring a lot of iterations and tuning with large apps to >>>> get stabilized. Therefore it will be important to do fast commits for >>>> these implementation stages. >>>> >>>> >>>> 1. Stage 1: Preparation >>>> ------------------------------------ >>>> >>>> >>>> The first planned sets of patches are enablers for ThinLTO work: >>>> >>>> >>>> a. LTO directory structure >>>> >>>> >>>> Restructure the LTO directory to remove circular dependence when >>>> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass >>>> within Transforms/IPO, and leverages the LTOModule class for linking >>>> in functions from modules, IPO then requires the LTO library. This >>>> creates a circular dependence between LTO and IPO. To break that, we >>>> need to split the lib/LTO directory/library into lib/LTO/CodeGen and >>>> lib/LTO/Module, containing LTOCodeGenerator and LTOModule, >>>> respectively. Only LTOCodeGenerator has a dependence on IPO, removing >>>> the circular dependence. >>>> >>>> >>>> Note that libLTO and llvm-lto use LTOModule/LTOCodeGenerator, whereas >>>> the gold plugin uses lib/Object/IRObject and lib/Linker directly. The >>>> use of LTOModule in the ThinLTO pass is a convenience, but could be >>>> avoided by using the IRObject/Linker methods directly if that is >>>> preferred. >>>> >>>> >>>> b. Native object wrapper generation support >>>> >>>> >>>> Implement native-object wrapped bitcode writer. The main goal is to >>>> more easily interact with existing native tools such as $AR, $NM, “$LD >>>> -r”, $OBJCOPY, and $RANLIB, without requiring the build system to find >>>> and pass the plugin as an option. We plan to emit the phase-1 bitcode >>>> wrapped in native object format via the .llvmbc section, along with a >>>> symbol table. We will implement ELF first, but subsequently extend >>>> support to COFF and Mach-O. Additionally, we also want to avoid doing >>>> partial LTO/ThinLTO across files linked with “$LD -r” (i.e. the >>>> resulting object file should still contain native object-wrapped >>>> bitcode to enable ThinLTO at the full link step). I will send a >>>> separate design document for these changes, including the format of >>>> the symtab and function index/summary section, but the following is a >>>> high-level motivation and overview. >>>> >>>> >>>> Note that support for ThinLTO using bitcode can be added as a >>>> follow-on under an option, so that bitcode-aware tools do not need to >>>> use the wrapper. Under the bitcode-only option, the symbol table will >>>> be replaced by the bitcode form of the function index and summary >>>> section, which can be encoded as a new bitcode block type. Changes >>>> should be made to the gold plugin to avoid partial link of bitcode >>>> files under “$LD -r” (emitting bitcode rather than compiling all the >>>> way down to native code, which is how ld64 behaves on Darwin as per >>>> dexonsmith). >>>> >>>> >>>> Advantages of using native object format: >>>> * Out of the box interoperability with existing native build tools >>>> ($AR, $NM, “$LD -r”, $OBJCOPY, and $RANLIB) which may not currently >>>> know how to locate/pass the appropriate plugin. >>>> * There is precedence in using this format: other compilers also wrap >>>> intermediate LTO files (probably related to the above advantage)[1]. >>>> * Tools that modify symbol linkage and visibility (e.g. $OBJCOPY and >>>> “$LD -r”) can mark the change in the symbol table without needing to >>>> parse/change/encode bitcode. The change can be propagated to bitcode >>>> by the ThinLTO backend. >>>> * Some tools only need to read/write the symtab and can avoid >>>> parsing/encoding bitcode (e.g. $NM, $OBJCOPY). >>>> * The second phase of ThinLTO does not need to parse the bitcode when >>>> creating the combined function index. >>>> >>>> >>>> Disadvantages of using native object format: >>>> * Unnecessary when using plugins with plugin-aware native tools, or >>>> LLVM’s custom tools. >>>> * Slightly increase disk storage and I/O from symtab. However, with >>>> our design the symtab is leveraged to hold function indexing info >>>> required for ThinLTO. The I/O for some build tools and build steps can >>>> actually be reduced as there is no need to read the bitcode, as >>>> described above. >>>> >>>> >>>> Support was added to LLVM for reading native object-wrapped bitcode >>>> (http://reviews.llvm.org/rL218078), but there does not yet exist >>>> support in LLVM/Clang for emitting bitcode wrapped in native object >>>> format. I plan to add support for optionally generating bitcode in an >>>> native object file containing a single .llvmbc section holding the >>>> bitcode. Specifically, the patch would add new options >>>> “emit-llvm-native-object” (object file) and corresponding >>>> “emit-llvm-native-assembly” (textual assembly code equivalent). >>>> Eventually these would be automatically triggered under “-fthinlto -c” >>>> and “-fthinlto -S”, respectively. >>>> >>>> >>>> Additionally, a symbol table will be generated in the native object >>>> file, holding the function symbols within the bitcode. This >>>> facilitates handling archives of the native object-wrapped bitcode >>>> created with $AR, since the archive will have a symbol table as well. >>>> The archive symbol table enables gold to extract and pass to the >>>> plugin the constituent native object-wrapped bitcode files. To support >>>> the concatenated llvmbc section generated by “$LD -r”, some handling >>>> needs to be added to gold and to the backend driver to process each >>>> original module’s bitcode. >>>> >>>> >>>> The function index/summary will later be added as a special native >>>> object section alongside the .llvmbc sections. The offset and size of >>>> the corresponding function summary can be placed in the associated >>>> symtab entry. As noted above, a separate design document will be sent >>>> for the native object format changes. >>>> >>>> >>>> 2. Stage 2: ThinLTO Infrastructure >>>> ------------------------------------------------------ >>>> >>>> >>>> The next set of patches adds the base implementation of the ThinLTO >>>> infrastructure, specifically those required to make ThinLTO functional >>>> and generate correct but not necessarily high-performing binaries. >>>> >>>> >>>> a. Clang/LLVM/gold linker options >>>> >>>> >>>> An early set of clang/llvm patches is needed to provide options to >>>> enable ThinLTO (off by default), so that the rest of the >>>> implementation can be disabled by default as it is added. >>>> Specifically, clang options -fthinlto (used instead of -flto) will >>>> cause clang to invoke the phase-1 emission of LLVM bitcode and >>>> function summary/index on a compile step, and pass the appropriate >>>> option to the gold plugin on a link step. The -thinlto option will be >>>> added to the gold plugin and llvm-lto tool to launch the phase-2 thin >>>> archive step. The -thinlto-be option will also be added to clang to >>>> invoke it as a phase-3 parallel backend instance with a bitcode file >>>> as input. >>>> >>>> >>>> b. Thin-archive linking support in Gold plugin and llvm-lto >>>> >>>> >>>> Under the new plugin option (see above), the plugin needs to perform >>>> the phase-2 (thin archive) link which simply emits a combined function >>>> index from the linked modules, without actually performing the normal >>>> link. Corresponding support should be added to the standalone llvm-lto >>>> tool to enable testing/debugging without involving the linker and >>>> plugin. >>>> >>>> >>>> c. ThinLTO backend support >>>> >>>> >>>> Support for invoking a phase-3 backend invocation (including >>>> importing) on a module should be added to the clang driver under the >>>> new option. The main change under the option is to instantiate a >>>> Linker object used to manage the process of linking imported functions >>>> into the module, efficient read of the combined function index, and >>>> enable the ThinLTO import pass. >>>> >>>> >>>> d. Function index/summary support >>>> >>>> >>>> This includes infrastructure for writing and reading the function >>>> index/summary section. As noted earlier this will be encoded in a >>>> special section within the native object file for the module, >>>> alongside the .llvmbc section containing the bitcode. The thin archive >>>> (combined function index) generated by phase-2 of ThinLTO simply >>>> contains all of the function index/summary sections across the linked >>>> modules, organized for efficient function lookup. As mentioned earlier >>>> when discussing the native object wrapper format, a separate design >>>> document will be sent for this format. >>>> >>>> >>>> Each function available for importing from the module contains an >>>> entry in the module’s function index/summary section and in the >>>> resulting combined function index. Each function entry contains that >>>> function’s offset within the bitcode file, used to efficiently locate >>>> and quickly import just that function (see below in 2e for more >>>> details on the importing mechanics). The entry also contains summary >>>> information (e.g. basic information determined during parsing such as >>>> the number of instructions in the function), that will be used to help >>>> guide later import decisions. Because the contents of this section >>>> will change frequently during ThinLTO tuning, it should also be marked >>>> with a version id for backwards compatibility or version checking. >>>> >>>> >>>> e. ThinLTO importing support >>>> >>>> >>>> Support for the mechanics of importing functions from other modules, >>>> which can go in gradually as a set of patches since it will be off by >>>> default (the ThinLTO pass itself discussed below in 2f). >>>> >>>> >>>> Note that ThinLTO function importing is iterative, and we may import >>>> from a number of modules in an interleaved fashion. For example, >>>> assume we have hot call chains a()->b1()->c() and a()->b2()->d(), >>>> where functions a(), b1()/b2(), c() and d() are from modules A, B, C >>>> and D, respectively. When performing ThinLTO backend compilation of >>>> module A, we may decide to import in the following order (based on >>>> callsite and function summary info): >>>> 1. B::b1() # exposes call to c() >>>> 2. C::c() >>>> 3. B::b2() # exposes call to d() >>>> 4. D::d() >>>> For this reason, ThinLTO importing is different than regular LTO >>>> bitcode reading and linking, which reads and links in a module in its >>>> entirety on a single pass through each module (notice in the above >>>> example the imports of the two module B functions have an intervening >>>> import from module C). As a result, for example, the existing support >>>> for lazy metadata parsing that delays it until the first function is >>>> materialized can’t be leveraged (metadata handling is discussed more >>>> below in 2h). Therefore, the ThinLTO importing pass instantiates a new >>>> BitcodeReader and LTOModule object for each function we decide to >>>> import, parsing only what is needed and linking in just that function. >>>> This is fast and efficient as found in the prototype results shown in >>>> the linked EuroLLVM slides. >>>> >>>> >>>> Separate patches can include: >>>> >>>> >>>> * BitcodeReader changes to use function index to import/deserialize >>>> single function of interest (small changes, leverages existing lazy >>>> function streamer support). The declarations and other symbol table >>>> info in the bitcode must be reloaded, but the bitcode parsing can stop >>>> once the first function body is hit. We simply set up an entry in the >>>> lazy streamer’s DeferredFunctionInfo function index map from the >>>> bitcode index that was saved in the ThinLTO function summary (and >>>> therefore don’t need to build up this function index structure through >>>> repeated calls to RememberAndSkipFunctionBody via >>>> FindFunctionInStream). >>>> * Minor LTOModule changes to pass the ThinLTO function to import and >>>> its index into bitcode reader (see 1a for discussion on LTOModule >>>> use). >>>> * Marking of imported functions. Most handling for ThinLTO imported >>>> functions will simply rely on applying the appropriate linkage type. >>>> But it is useful to know which functions were imported, both for >>>> compiler debugging and and verification, and possibly to modify some >>>> optimization heuristics along with the summary information. This can >>>> be in-memory initially, but IR support may be required in order to >>>> support streaming bitcode out and back in again after importing. >>>> * ModuleLinker changes to do ThinLTO-specific symbol linking and >>>> static promotion when necessary. The linkage type of imported >>>> non-local functions and variables changes to >>>> AvailableExternallyLinkage, for example. Statics must be promoted in >>>> certain cases, and accordingly renamed in consistent ways. Read-write >>>> or address-taken static variables must always be promoted. Other >>>> discardable functions, i.e. link-once such as comdats, will be force >>>> imported on reference by another imported function. We are working on >>>> a separate design document describing these changes in more detail >>>> with examples, as a more detailed discussion of these changes is >>>> beyond the scope of this RFC. >>>> * GlobalDCE changes to support removing imported non-local functions >>>> that were not inlined and imported non-local variables, which are >>>> marked AvailableExternallyLinkage (very small changes to existing pass >>>> logic). As discussed in the original RFC threads, currently GlobalDCE >>>> does not remove referenced AvailableExternallyLinkage functions. >>>> Instead, these are suppressed later during code generation. It isn’t >>>> clear that these functions are useful past the first call to >>>> GlobalDCE, which is after inlining, GlobalOpt and IPSCCP (so >>>> presumably after inter procedural constant prop, etc). Patch with >>>> these changes in testing as discussed in this thread: >>>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085807.html. >>>> >>>> >>>> f. ThinLTO Import Driver SCC pass >>>> >>>> >>>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via >>>> an SCC pass, enabled only under the -fthinlto-be option. The pass >>>> includes utilizing the thin archive[2] (combined global function >>>> index/summary), import decision heuristics, invocation of >>>> LTOModule/ModuleLinker routines that perform the import, and any >>>> necessary callgraph updates and verification. >>>> >>>> >>>> g. Backend Driver >>>> >>>> >>>> For a single node build, the gold plugin will initially exec the >>>> backend processes directly, with the amount of parallelism controlled >>>> via an option and/or env variable. It is also possible to leverage >>>> existing single node build system task dispatching mechanisms such as >>>> Unix Makefiles, Ninja, etc., where the plugin can simply write a build >>>> file and fork the parallel backend instances directly under an >>>> appropriate option. We will also initially add support for our >>>> distributed build system as described below under 3c. >>>> >>>> >>>> h. Lazy Debug Metadata Linking >>>> >>>> >>>> The prototype implementation included lazy importing of module-level >>>> metadata during the ThinLTO pass finalization (i.e. after all function >>>> importing is complete). This actually applies to all module-level >>>> metadata, not just debug, although it is the largest. This can be >>>> added as a separate set of patches, and the detailed design will be >>>> sent with those. Includes changes to BitcodeReader, ValueMapper, and >>>> the ModuleLinker classes. As described in 2e, due to the >>>> iterative/interleaved nature of ThinLTO importing, the bitcode parsing >>>> is structured differently than LTO where a single pass over each >>>> module can be performed to parse and materialize all functions and >>>> metadata. Therefore, the lazy metadata parsing support in >>>> BitcodeReader, which parses all the metadata once the first function >>>> is materialized, are not applicable. We may instantiate a >>>> BitcodeReader multiple times for a module, if multiple functions are >>>> eventually imported, and we need a way to suture up the metadata to >>>> the functions imported by an earlier BitcodeReader instantiation. The >>>> high level summary is that during the initial import we leave the >>>> temporary metadata on the instructions that were imported, but save >>>> the index used by the bitcode reader used to correlate with the >>>> metadata when it is ready (i.e. the MDValuePtrs index), and skip the >>>> metadata parsing. During the ThinLTO pass finalization we parse just >>>> the metadata, and suture it up during metadata value mapping using the >>>> saved index. As mentioned earlier, this will be described in more >>>> detail when the patches are ready. >>>> >>>> >>>> 3. Stage 3: ThinLTO Tuning and Enhancements >>>> ------------------------------------------------------------------------- >>>> >>>> >>>> This refers to the patches that are not required for ThinLTO to work, >>>> but rather to improve compile time, memory, run-time performance and >>>> usability. >>>> >>>> >>>> a. Import Tuning >>>> >>>> >>>> Tuning the import strategy will be an iterative process that will >>>> continue to be refined over time. It involves several different types >>>> of changes: adding support for recording additional metrics in the >>>> function summary, such as profile data and optional heavier-weight IPA >>>> analyses, and tuning the import heuristics based on the summary and >>>> callsite context. >>>> >>>> >>>> b. Combined Function Index Pruning >>>> >>>> >>>> The combined function index can be pruned of functions that are >>>> unlikely to benefit from being imported. For example, during the >>>> phase-2 thin archive plug step we can safely omit large and (with >>>> profile data) cold functions, which are unlikely to benefit from being >>>> inlined. Additionally, all but one copy of comdat functions can be >>>> suppressed. >>>> >>>> >>>> c. Distributed Build System Integration >>>> >>>> >>>> For a distributed build system such as Bazel (http://bazel.io/), the >>>> gold plugin should write the parallel backend invocations into a build >>>> file, including the mapping from the IR file to the real object file >>>> path, and exit. Additional work needs to be done in the distributed >>>> build system itself to distribute and dispatch the parallel backend >>>> jobs to the build cluster. >>>> >>>> >>>> d. Dependence Tracking and Incremental Compiles >>>> >>>> >>>> In order to support build systems that stage from local disks or >>>> network storage, the plugin will optionally support computation of >>>> dependent sets of IR files that each module may import from. This can >>>> be computed from profile data, if it exists, or from the symbol table >>>> and heuristics if not. These dependence sets also enable support for >>>> incremental backend compiles. >>>> >>>> >>>> ________________ >>>> [1] The following compilers currently wrap intermediate LTO files in >>>> native object format: GCC fat and non-fat objects (with a custom >>>> symtab), Intel icc non-fat (IR-only) objects (with a full native >>>> symtab), HP’s aCC non-fat objects (with full native symtab), IBM xlC >>>> both fat and non-fat objects (with full native symtab). >>>> [2] The “thin archive” here (also referred to as a combined function >>>> index) has some similarities to the AR tool thin archive format, but >>>> is not exactly the same. Both contain the symtab and not the code, but >>>> the ThinLTO combined function index contains the summary sections as >>>> well. >>>> >>>> -- >>>> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-- Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On > Behalf Of Teresa Johnson > Sent: Wednesday, June 03, 2015 7:02 AM > To: Dave Bozier > Cc: <llvmdev at cs.uiuc.edu> List > Subject: Re: [LLVMdev] Updated RFC: ThinLTO Implementation Plan > > On Wed, Jun 3, 2015 at 4:19 AM, Dave Bozier <seifsta at gmail.com> wrote: > > Hi Teresa, > > > > Thanks for providing this updated RFC. > > > >> For Sony's linker, are you using the gold plugin or libLTO interfaces? > >> If the latter, I suppose some ThinLTO handling would have to be added > >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 > >> combined function map and either launch the backend processes in > >> parallel or write out a make or other build file). The current support > >> for reading native object wrapped bitcode is baked into IRObjectFile > >> so presumably the Sony linker can handle these native object wrapped > >> bitcode files if it uses libLTO. We would similarly embed the handling > >> of the function index/summary behind an API that can handle either so > >> it is similarly transparent to the linkers. Let me know if there would > >> be additional issues that make wrapped bitcode more difficult in your > >> case, or how we could make ThinLTO usage simpler for you in general. > > We use the libLTO interfaces. > > Hi Dave, > > Thanks for the info. > > > > > We use the libLTO interfaces, more specifically we use the C API > > located in llvm-c\lto.h. > > > > Our linker won't support native object wrapped bitcode files as our > > LTO is it currently stands. Right now, it will be recognized as an > > object file and won't get anywhere near the libLTO libraries. We'd > > need to teach our linker to recognize and differentiate native object > > wrapped bitcode files and regular native object files. This isn't > > straight forward as we cannot distinguish them just by looking at the > > file header alone, we would need to parse the sections and look for a > > .llvmbc section. We then need to add special handling of these native > > object wrappers. > > Ok, I see. Does it help that there are LTOModule (lto_module_* in the > C API) interfaces for checking if a file contains bitcode (regardless > of whether it is straight-up or native-wrapped)? I don't know how hard > in your linker it is to query these when deciding whether to treat the > object file as bitcode or not, or how hard it is to pass the resulting > object file along to the libLTO routines for handling (they > automatically handle the native-wrapped object files so the linker > shouldn't have to do anything special to read them).One twist is that we use the Darwin-style wrapper around our bitcode files so that we have a place to hang a bitcode version number, which we also want to check. Without reopening the debate about why we do that, we do that, and I fully expect the libLTO API to silently ignore the wrapper that we are depending on. I suppose we could add a new libLTO API that verifies the bitcode wrapper but it would be yet another private change to maintain, rather than just having the linker check it directly. --paulr> > Specifically, in the C API these are the lto_module_is_object_file* > variants, which will return true for either straight-up or > native-wrapped bitcode. All of the mechanics of handling bitcode vs > native object-wrapped bitcode are down in the IRObjectFile handling. > So the LTOModule:isBitcode*/lto_module_is_object_file* will correctly > identify native object-wrapped bitcode as bitcode. And the > LTOModule::createFrom*/lto_module_create* routines correctly parse the > native object-wrapped bitcode and return an LTOmodule. > > As a result, the llvm-lto tool that also uses libLTO interfaces didn't > require any changes when the native-wrapped reading support went in > (r218078), and is able to handle native-wrapped bitcode out of the > box. > > > > > Handling the function index/summary behind an API sounds like a good > idea. > > I am going to work on fleshing out this part next so that the actual > format of the files is hidden from clients. > > Thanks, > Teresa > > > > > On Fri, May 29, 2015 at 4:01 PM, Teresa Johnson <tejohnson at google.com> > wrote: > >> On Fri, May 29, 2015 at 6:56 AM, Alex Rosenberg <alexr at leftfield.org> > wrote: > >>> My earlier statement about wrapping things in a native object file > held in that it is controversial. It appears to be still central to your > design. > >>> > >>> It may help to look at the problem from a different viewpoint: LLVM is > not a compiler. It is a framework that can be used to make compiler-like > tools. > >>> > >>> From that view, it no longer makes sense to discuss "the plugin," or > gold, or $AR, because there isn't just one of any of those things. ld64 > isn't the only outlier linker to consider. We have our own linker at Sony, > for example. From this perspective, then it makes more sense to consider > replacing the binary utilities with ones that support bitcode, because > from a user-perspective, all of the linkers already transparently support > bitcode directly today, as do ar, nm, etc. This has been necessary for the > regular LTO process. > >> > >> Hi Alex, > >> > >> It's true that the LLVM versions of these tools support bitcode > >> transparently, but not all build systems use LLVM versions of these > >> tools, particularly build systems that support a variety of compilers, > >> or legacy build systems. And not all build systems have the plugin or > >> currently pass it to the native tools that can take a plugin for > >> handling bitcode. In those cases the bitcode support is not > >> transparently available, and our aim is to reduce the friction as much > >> as possible. And not all use LTO currently (I know we don't due to the > >> scalability issues we're trying to address with this design), and in > >> those cases the migration to bitcode-aware tools and plugins was not > >> previously required. > >> > >> For Sony's linker, are you using the gold plugin or libLTO interfaces? > >> If the latter, I suppose some ThinLTO handling would have to be added > >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 > >> combined function map and either launch the backend processes in > >> parallel or write out a make or other build file). The current support > >> for reading native object wrapped bitcode is baked into IRObjectFile > >> so presumably the Sony linker can handle these native object wrapped > >> bitcode files if it uses libLTO. We would similarly embed the handling > >> of the function index/summary behind an API that can handle either so > >> it is similarly transparent to the linkers. Let me know if there would > >> be additional issues that make wrapped bitcode more difficult in your > >> case, or how we could make ThinLTO usage simpler for you in general. > >> > >>> > >>> The only tool in the list of tools you mentioned that do not support > bitcode directly is objcopy, and that's because nobody has yet written an > LLVM-project implementation of it. Personally, I'd much rather you focus > on making ThinLTO work by extending bitcode as needed, and we work as a > community toward replacing objcopy with an LLVM-native one. It's a big > missing piece of the LLVM project today and could be so much better if we > could use it to replace Apple's lipo and possibly other extant object file > modification tools. (Has anyone surveyed this area?) > >>> > >>> That older toolchains have tried to slip non-object file data through > the binary utilities isn't really proof that this is a good choice. It > might simply reflect the realities of those engineering teams. I wasn't at > Sun for this, but DTrace needed a linker feature that apparently the Sun > linker team was unwilling or unable to provide, so dtrace(1) gained the > ability to modify ELF files directly as needed. That doesn't prove that > DTrace's USDT feature shouldn't have been implemented in the linker (as > ld64 does directly for Apple), does it? > >> > >> I'd argue that the realities being addressed by using native object > >> format in those cases still exist. > >> > >>> > >>> If in the end using native object-wrapped bitcode is the best > solution, so be it. However, I think it is largely orthogonal to ThinLTO's > needs for transporting symtab data alongside the existing bitcode format. > >> > >> That's certainly true, ThinLTO can be implemented using either format, > >> and bitcode only support can certainly be implemented. It is a matter > >> of prioritizing which format to implement first. I had added some > >> description to the updated RFC on how the function index/summary can > >> be represented, etc in bitcode. Prioritizing the native object format > >> doesn't make it easier to implement ThinLTO, but should make it easier > >> to deploy. > >> > >> Thanks! > >> Teresa > >> > >>> > >>> Alex > >>> > >>>> On May 28, 2015, at 2:10 PM, Teresa Johnson <tejohnson at google.com> > wrote: > >>>> > >>>> As promised, here is an new version of the ThinLTO RFC, updated based > >>>> on some of the comments, questions and feedback from the first RFC. > >>>> Hopefully we have addressed many of these, and as noted below, will > >>>> fork some of the detailed discussion on particular aspects into > >>>> separate design doc threads. Please send any additional feedback and > >>>> questions on the overall design. > >>>> Thanks! > >>>> Teresa > >>>> > >>>> > >>>> Updated RFC to discuss plans for implementing ThinLTO upstream, > >>>> reflecting feedback and discussion from initial RFC > >>>> (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085557.html). As > >>>> discussed in the earlier thread and below, more detailed design > >>>> documents for several pieces (native object format, linkage type > >>>> changes and static promotions, etc) are in progress and will be sent > >>>> separately. This RFC covers the overall design and the breakdown of > >>>> work at a higher level. > >>>> > >>>> > >>>> Background on ThinLTO can be found in slides from EuroLLVM 2015: > >>>> > https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0 > >>>> As described in the talk, we have a prototype implementation, and > >>>> would like to start staging patches upstream. This RFC describes a > >>>> breakdown of the major pieces. We would like to commit upstream > >>>> gradually in several stages, with all functionality off by default. > >>>> The core ThinLTO importing support and tuning will require frequent > >>>> change and iteration during testing and tuning, and for that part we > >>>> would like to commit rapidly (off by default). See the proposed > staged > >>>> implementation described in the Implementation Plan section. > >>>> > >>>> > >>>> ThinLTO Overview > >>>> =================> >>>> > >>>> > >>>> See the talk slides linked above for more details. The following is a > >>>> high-level overview of the motivation. > >>>> > >>>> > >>>> Cross Module Optimization (CMO) is an effective means for improving > >>>> runtime performance, by extending the scope of optimizations across > >>>> source module boundaries. Without CMO, the compiler is limited to > >>>> optimizing within the scope of single source modules. Two solutions > >>>> for enabling CMO are Link-Time Optimization (LTO), which is currently > >>>> supported in LLVM and GCC, and Lightweight-Interprocedural > >>>> Optimization (LIPO). However, each of these solutions has limitations > >>>> that prevent it from being enabled by default. ThinLTO is a new > >>>> approach that attempts to address these limitations, with a goal of > >>>> being enabled more broadly. ThinLTO is designed with many of the same > >>>> principals as LIPO, and therefore its advantages, without any of its > >>>> inherent weakness. Unlike in LIPO where the module group decision is > >>>> made at profile training runtime, ThinLTO makes the decision at > >>>> compile time, but in a lazy mode that facilitates large scale > >>>> parallelism. LTO implementations all contain a serial IPA/IPO step > >>>> that is both memory intensive and slow, limiting usability on both > >>>> smaller workstations and huge applications. In contrast, the ThinLTO > >>>> serial linker plugin phase is designed to be razor thin and blazingly > >>>> fast. By default this step only does minimal preparation work to > >>>> enable the parallel lazy importing performed later. ThinLTO aims to > be > >>>> scalable like a regular O2 build, enabling CMO on machines without > >>>> large memory configurations, while also integrating well with > >>>> distributed build systems. Results from early prototyping on SPEC > >>>> cpu2006 C++ benchmarks are in line with expectations that ThinLTO can > >>>> scale like O2 while enabling much of the CMO performed during a full > >>>> LTO build. > >>>> > >>>> > >>>> A ThinLTO build is divided into 3 phases, which are referred to in > the > >>>> following implementation plan: > >>>> 1. phase-1: IR and Function Summary Generation (-c compile) > >>>> 2. phase-2: Thin Linker Plugin Layer (thin archive linker step) > >>>> 3. phase-3: Parallel Backend with Demand-Driven Importing > >>>> > >>>> > >>>> Implementation Plan > >>>> ===================> >>>> > >>>> > >>>> This section gives a high-level breakdown of the ThinLTO support that > >>>> will be added, in roughly the order that the patches would be staged. > >>>> The patches are divided into three stages. The first stage contains a > >>>> minimal amount of preparation work that is not ThinLTO-specific. The > >>>> second stage contains most of the infrastructure for ThinLTO, which > >>>> will be off by default. The third stage includes > >>>> enhancements/improvements/tunings that can be performed after the > main > >>>> ThinLTO infrastructure is in. > >>>> > >>>> > >>>> The second and third implementation stages will initially be very > >>>> volatile, requiring a lot of iterations and tuning with large apps to > >>>> get stabilized. Therefore it will be important to do fast commits for > >>>> these implementation stages. > >>>> > >>>> > >>>> 1. Stage 1: Preparation > >>>> ------------------------------------ > >>>> > >>>> > >>>> The first planned sets of patches are enablers for ThinLTO work: > >>>> > >>>> > >>>> a. LTO directory structure > >>>> > >>>> > >>>> Restructure the LTO directory to remove circular dependence when > >>>> ThinLTO pass added. Because ThinLTO is being implemented as a SCC > pass > >>>> within Transforms/IPO, and leverages the LTOModule class for linking > >>>> in functions from modules, IPO then requires the LTO library. This > >>>> creates a circular dependence between LTO and IPO. To break that, we > >>>> need to split the lib/LTO directory/library into lib/LTO/CodeGen and > >>>> lib/LTO/Module, containing LTOCodeGenerator and LTOModule, > >>>> respectively. Only LTOCodeGenerator has a dependence on IPO, removing > >>>> the circular dependence. > >>>> > >>>> > >>>> Note that libLTO and llvm-lto use LTOModule/LTOCodeGenerator, whereas > >>>> the gold plugin uses lib/Object/IRObject and lib/Linker directly. The > >>>> use of LTOModule in the ThinLTO pass is a convenience, but could be > >>>> avoided by using the IRObject/Linker methods directly if that is > >>>> preferred. > >>>> > >>>> > >>>> b. Native object wrapper generation support > >>>> > >>>> > >>>> Implement native-object wrapped bitcode writer. The main goal is to > >>>> more easily interact with existing native tools such as $AR, $NM, > “$LD > >>>> -r”, $OBJCOPY, and $RANLIB, without requiring the build system to > find > >>>> and pass the plugin as an option. We plan to emit the phase-1 bitcode > >>>> wrapped in native object format via the .llvmbc section, along with a > >>>> symbol table. We will implement ELF first, but subsequently extend > >>>> support to COFF and Mach-O. Additionally, we also want to avoid doing > >>>> partial LTO/ThinLTO across files linked with “$LD -r” (i.e. the > >>>> resulting object file should still contain native object-wrapped > >>>> bitcode to enable ThinLTO at the full link step). I will send a > >>>> separate design document for these changes, including the format of > >>>> the symtab and function index/summary section, but the following is a > >>>> high-level motivation and overview. > >>>> > >>>> > >>>> Note that support for ThinLTO using bitcode can be added as a > >>>> follow-on under an option, so that bitcode-aware tools do not need to > >>>> use the wrapper. Under the bitcode-only option, the symbol table will > >>>> be replaced by the bitcode form of the function index and summary > >>>> section, which can be encoded as a new bitcode block type. Changes > >>>> should be made to the gold plugin to avoid partial link of bitcode > >>>> files under “$LD -r” (emitting bitcode rather than compiling all the > >>>> way down to native code, which is how ld64 behaves on Darwin as per > >>>> dexonsmith). > >>>> > >>>> > >>>> Advantages of using native object format: > >>>> * Out of the box interoperability with existing native build tools > >>>> ($AR, $NM, “$LD -r”, $OBJCOPY, and $RANLIB) which may not currently > >>>> know how to locate/pass the appropriate plugin. > >>>> * There is precedence in using this format: other compilers also wrap > >>>> intermediate LTO files (probably related to the above advantage)[1]. > >>>> * Tools that modify symbol linkage and visibility (e.g. $OBJCOPY and > >>>> “$LD -r”) can mark the change in the symbol table without needing to > >>>> parse/change/encode bitcode. The change can be propagated to bitcode > >>>> by the ThinLTO backend. > >>>> * Some tools only need to read/write the symtab and can avoid > >>>> parsing/encoding bitcode (e.g. $NM, $OBJCOPY). > >>>> * The second phase of ThinLTO does not need to parse the bitcode when > >>>> creating the combined function index. > >>>> > >>>> > >>>> Disadvantages of using native object format: > >>>> * Unnecessary when using plugins with plugin-aware native tools, or > >>>> LLVM’s custom tools. > >>>> * Slightly increase disk storage and I/O from symtab. However, with > >>>> our design the symtab is leveraged to hold function indexing info > >>>> required for ThinLTO. The I/O for some build tools and build steps > can > >>>> actually be reduced as there is no need to read the bitcode, as > >>>> described above. > >>>> > >>>> > >>>> Support was added to LLVM for reading native object-wrapped bitcode > >>>> (http://reviews.llvm.org/rL218078), but there does not yet exist > >>>> support in LLVM/Clang for emitting bitcode wrapped in native object > >>>> format. I plan to add support for optionally generating bitcode in an > >>>> native object file containing a single .llvmbc section holding the > >>>> bitcode. Specifically, the patch would add new options > >>>> “emit-llvm-native-object” (object file) and corresponding > >>>> “emit-llvm-native-assembly” (textual assembly code equivalent). > >>>> Eventually these would be automatically triggered under “-fthinlto - > c” > >>>> and “-fthinlto -S”, respectively. > >>>> > >>>> > >>>> Additionally, a symbol table will be generated in the native object > >>>> file, holding the function symbols within the bitcode. This > >>>> facilitates handling archives of the native object-wrapped bitcode > >>>> created with $AR, since the archive will have a symbol table as well. > >>>> The archive symbol table enables gold to extract and pass to the > >>>> plugin the constituent native object-wrapped bitcode files. To > support > >>>> the concatenated llvmbc section generated by “$LD -r”, some handling > >>>> needs to be added to gold and to the backend driver to process each > >>>> original module’s bitcode. > >>>> > >>>> > >>>> The function index/summary will later be added as a special native > >>>> object section alongside the .llvmbc sections. The offset and size of > >>>> the corresponding function summary can be placed in the associated > >>>> symtab entry. As noted above, a separate design document will be sent > >>>> for the native object format changes. > >>>> > >>>> > >>>> 2. Stage 2: ThinLTO Infrastructure > >>>> ------------------------------------------------------ > >>>> > >>>> > >>>> The next set of patches adds the base implementation of the ThinLTO > >>>> infrastructure, specifically those required to make ThinLTO > functional > >>>> and generate correct but not necessarily high-performing binaries. > >>>> > >>>> > >>>> a. Clang/LLVM/gold linker options > >>>> > >>>> > >>>> An early set of clang/llvm patches is needed to provide options to > >>>> enable ThinLTO (off by default), so that the rest of the > >>>> implementation can be disabled by default as it is added. > >>>> Specifically, clang options -fthinlto (used instead of -flto) will > >>>> cause clang to invoke the phase-1 emission of LLVM bitcode and > >>>> function summary/index on a compile step, and pass the appropriate > >>>> option to the gold plugin on a link step. The -thinlto option will be > >>>> added to the gold plugin and llvm-lto tool to launch the phase-2 thin > >>>> archive step. The -thinlto-be option will also be added to clang to > >>>> invoke it as a phase-3 parallel backend instance with a bitcode file > >>>> as input. > >>>> > >>>> > >>>> b. Thin-archive linking support in Gold plugin and llvm-lto > >>>> > >>>> > >>>> Under the new plugin option (see above), the plugin needs to perform > >>>> the phase-2 (thin archive) link which simply emits a combined > function > >>>> index from the linked modules, without actually performing the normal > >>>> link. Corresponding support should be added to the standalone llvm- > lto > >>>> tool to enable testing/debugging without involving the linker and > >>>> plugin. > >>>> > >>>> > >>>> c. ThinLTO backend support > >>>> > >>>> > >>>> Support for invoking a phase-3 backend invocation (including > >>>> importing) on a module should be added to the clang driver under the > >>>> new option. The main change under the option is to instantiate a > >>>> Linker object used to manage the process of linking imported > functions > >>>> into the module, efficient read of the combined function index, and > >>>> enable the ThinLTO import pass. > >>>> > >>>> > >>>> d. Function index/summary support > >>>> > >>>> > >>>> This includes infrastructure for writing and reading the function > >>>> index/summary section. As noted earlier this will be encoded in a > >>>> special section within the native object file for the module, > >>>> alongside the .llvmbc section containing the bitcode. The thin > archive > >>>> (combined function index) generated by phase-2 of ThinLTO simply > >>>> contains all of the function index/summary sections across the linked > >>>> modules, organized for efficient function lookup. As mentioned > earlier > >>>> when discussing the native object wrapper format, a separate design > >>>> document will be sent for this format. > >>>> > >>>> > >>>> Each function available for importing from the module contains an > >>>> entry in the module’s function index/summary section and in the > >>>> resulting combined function index. Each function entry contains that > >>>> function’s offset within the bitcode file, used to efficiently locate > >>>> and quickly import just that function (see below in 2e for more > >>>> details on the importing mechanics). The entry also contains summary > >>>> information (e.g. basic information determined during parsing such as > >>>> the number of instructions in the function), that will be used to > help > >>>> guide later import decisions. Because the contents of this section > >>>> will change frequently during ThinLTO tuning, it should also be > marked > >>>> with a version id for backwards compatibility or version checking. > >>>> > >>>> > >>>> e. ThinLTO importing support > >>>> > >>>> > >>>> Support for the mechanics of importing functions from other modules, > >>>> which can go in gradually as a set of patches since it will be off by > >>>> default (the ThinLTO pass itself discussed below in 2f). > >>>> > >>>> > >>>> Note that ThinLTO function importing is iterative, and we may import > >>>> from a number of modules in an interleaved fashion. For example, > >>>> assume we have hot call chains a()->b1()->c() and a()->b2()->d(), > >>>> where functions a(), b1()/b2(), c() and d() are from modules A, B, C > >>>> and D, respectively. When performing ThinLTO backend compilation of > >>>> module A, we may decide to import in the following order (based on > >>>> callsite and function summary info): > >>>> 1. B::b1() # exposes call to c() > >>>> 2. C::c() > >>>> 3. B::b2() # exposes call to d() > >>>> 4. D::d() > >>>> For this reason, ThinLTO importing is different than regular LTO > >>>> bitcode reading and linking, which reads and links in a module in its > >>>> entirety on a single pass through each module (notice in the above > >>>> example the imports of the two module B functions have an intervening > >>>> import from module C). As a result, for example, the existing support > >>>> for lazy metadata parsing that delays it until the first function is > >>>> materialized can’t be leveraged (metadata handling is discussed more > >>>> below in 2h). Therefore, the ThinLTO importing pass instantiates a > new > >>>> BitcodeReader and LTOModule object for each function we decide to > >>>> import, parsing only what is needed and linking in just that > function. > >>>> This is fast and efficient as found in the prototype results shown in > >>>> the linked EuroLLVM slides. > >>>> > >>>> > >>>> Separate patches can include: > >>>> > >>>> > >>>> * BitcodeReader changes to use function index to import/deserialize > >>>> single function of interest (small changes, leverages existing lazy > >>>> function streamer support). The declarations and other symbol table > >>>> info in the bitcode must be reloaded, but the bitcode parsing can > stop > >>>> once the first function body is hit. We simply set up an entry in the > >>>> lazy streamer’s DeferredFunctionInfo function index map from the > >>>> bitcode index that was saved in the ThinLTO function summary (and > >>>> therefore don’t need to build up this function index structure > through > >>>> repeated calls to RememberAndSkipFunctionBody via > >>>> FindFunctionInStream). > >>>> * Minor LTOModule changes to pass the ThinLTO function to import and > >>>> its index into bitcode reader (see 1a for discussion on LTOModule > >>>> use). > >>>> * Marking of imported functions. Most handling for ThinLTO imported > >>>> functions will simply rely on applying the appropriate linkage type. > >>>> But it is useful to know which functions were imported, both for > >>>> compiler debugging and and verification, and possibly to modify some > >>>> optimization heuristics along with the summary information. This can > >>>> be in-memory initially, but IR support may be required in order to > >>>> support streaming bitcode out and back in again after importing. > >>>> * ModuleLinker changes to do ThinLTO-specific symbol linking and > >>>> static promotion when necessary. The linkage type of imported > >>>> non-local functions and variables changes to > >>>> AvailableExternallyLinkage, for example. Statics must be promoted in > >>>> certain cases, and accordingly renamed in consistent ways. Read-write > >>>> or address-taken static variables must always be promoted. Other > >>>> discardable functions, i.e. link-once such as comdats, will be force > >>>> imported on reference by another imported function. We are working on > >>>> a separate design document describing these changes in more detail > >>>> with examples, as a more detailed discussion of these changes is > >>>> beyond the scope of this RFC. > >>>> * GlobalDCE changes to support removing imported non-local functions > >>>> that were not inlined and imported non-local variables, which are > >>>> marked AvailableExternallyLinkage (very small changes to existing > pass > >>>> logic). As discussed in the original RFC threads, currently GlobalDCE > >>>> does not remove referenced AvailableExternallyLinkage functions. > >>>> Instead, these are suppressed later during code generation. It isn’t > >>>> clear that these functions are useful past the first call to > >>>> GlobalDCE, which is after inlining, GlobalOpt and IPSCCP (so > >>>> presumably after inter procedural constant prop, etc). Patch with > >>>> these changes in testing as discussed in this thread: > >>>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085807.html. > >>>> > >>>> > >>>> f. ThinLTO Import Driver SCC pass > >>>> > >>>> > >>>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via > >>>> an SCC pass, enabled only under the -fthinlto-be option. The pass > >>>> includes utilizing the thin archive[2] (combined global function > >>>> index/summary), import decision heuristics, invocation of > >>>> LTOModule/ModuleLinker routines that perform the import, and any > >>>> necessary callgraph updates and verification. > >>>> > >>>> > >>>> g. Backend Driver > >>>> > >>>> > >>>> For a single node build, the gold plugin will initially exec the > >>>> backend processes directly, with the amount of parallelism controlled > >>>> via an option and/or env variable. It is also possible to leverage > >>>> existing single node build system task dispatching mechanisms such as > >>>> Unix Makefiles, Ninja, etc., where the plugin can simply write a > build > >>>> file and fork the parallel backend instances directly under an > >>>> appropriate option. We will also initially add support for our > >>>> distributed build system as described below under 3c. > >>>> > >>>> > >>>> h. Lazy Debug Metadata Linking > >>>> > >>>> > >>>> The prototype implementation included lazy importing of module-level > >>>> metadata during the ThinLTO pass finalization (i.e. after all > function > >>>> importing is complete). This actually applies to all module-level > >>>> metadata, not just debug, although it is the largest. This can be > >>>> added as a separate set of patches, and the detailed design will be > >>>> sent with those. Includes changes to BitcodeReader, ValueMapper, and > >>>> the ModuleLinker classes. As described in 2e, due to the > >>>> iterative/interleaved nature of ThinLTO importing, the bitcode > parsing > >>>> is structured differently than LTO where a single pass over each > >>>> module can be performed to parse and materialize all functions and > >>>> metadata. Therefore, the lazy metadata parsing support in > >>>> BitcodeReader, which parses all the metadata once the first function > >>>> is materialized, are not applicable. We may instantiate a > >>>> BitcodeReader multiple times for a module, if multiple functions are > >>>> eventually imported, and we need a way to suture up the metadata to > >>>> the functions imported by an earlier BitcodeReader instantiation. The > >>>> high level summary is that during the initial import we leave the > >>>> temporary metadata on the instructions that were imported, but save > >>>> the index used by the bitcode reader used to correlate with the > >>>> metadata when it is ready (i.e. the MDValuePtrs index), and skip the > >>>> metadata parsing. During the ThinLTO pass finalization we parse just > >>>> the metadata, and suture it up during metadata value mapping using > the > >>>> saved index. As mentioned earlier, this will be described in more > >>>> detail when the patches are ready. > >>>> > >>>> > >>>> 3. Stage 3: ThinLTO Tuning and Enhancements > >>>> --------------------------------------------------------------------- > ---- > >>>> > >>>> > >>>> This refers to the patches that are not required for ThinLTO to work, > >>>> but rather to improve compile time, memory, run-time performance and > >>>> usability. > >>>> > >>>> > >>>> a. Import Tuning > >>>> > >>>> > >>>> Tuning the import strategy will be an iterative process that will > >>>> continue to be refined over time. It involves several different types > >>>> of changes: adding support for recording additional metrics in the > >>>> function summary, such as profile data and optional heavier-weight > IPA > >>>> analyses, and tuning the import heuristics based on the summary and > >>>> callsite context. > >>>> > >>>> > >>>> b. Combined Function Index Pruning > >>>> > >>>> > >>>> The combined function index can be pruned of functions that are > >>>> unlikely to benefit from being imported. For example, during the > >>>> phase-2 thin archive plug step we can safely omit large and (with > >>>> profile data) cold functions, which are unlikely to benefit from > being > >>>> inlined. Additionally, all but one copy of comdat functions can be > >>>> suppressed. > >>>> > >>>> > >>>> c. Distributed Build System Integration > >>>> > >>>> > >>>> For a distributed build system such as Bazel (http://bazel.io/), the > >>>> gold plugin should write the parallel backend invocations into a > build > >>>> file, including the mapping from the IR file to the real object file > >>>> path, and exit. Additional work needs to be done in the distributed > >>>> build system itself to distribute and dispatch the parallel backend > >>>> jobs to the build cluster. > >>>> > >>>> > >>>> d. Dependence Tracking and Incremental Compiles > >>>> > >>>> > >>>> In order to support build systems that stage from local disks or > >>>> network storage, the plugin will optionally support computation of > >>>> dependent sets of IR files that each module may import from. This can > >>>> be computed from profile data, if it exists, or from the symbol table > >>>> and heuristics if not. These dependence sets also enable support for > >>>> incremental backend compiles. > >>>> > >>>> > >>>> ________________ > >>>> [1] The following compilers currently wrap intermediate LTO files in > >>>> native object format: GCC fat and non-fat objects (with a custom > >>>> symtab), Intel icc non-fat (IR-only) objects (with a full native > >>>> symtab), HP’s aCC non-fat objects (with full native symtab), IBM xlC > >>>> both fat and non-fat objects (with full native symtab). > >>>> [2] The “thin archive” here (also referred to as a combined function > >>>> index) has some similarities to the AR tool thin archive format, but > >>>> is not exactly the same. Both contain the symtab and not the code, > but > >>>> the ThinLTO combined function index contains the summary sections as > >>>> well. > >>>> > >>>> -- > >>>> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460- > 2413 > >>>> > >>>> _______________________________________________ > >>>> LLVM Developers mailing list > >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460- > 2413 > >> > >> _______________________________________________ > >> LLVM Developers mailing list > >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > -- > Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> Ok, I see. Does it help that there are LTOModule (lto_module_* in the > C API) interfaces for checking if a file contains bitcode (regardless > of whether it is straight-up or native-wrapped)?Unfortunately no it won't help. We try to identify inputs on the command line in the order 1) Object 2) static library 3) Bitcode 4) Linker script. Our LTO implementation is split into a separate process that is lazily loaded only in the case that we are trying to identify an input and don't recognize it as an object or static library and suspect it is a bitcode file.> I don't know how hard > in your linker it is to query these when deciding whether to treat the > object file as bitcode or not, or how hard it is to pass the resulting > object file along to the libLTO routines for handling (they > automatically handle the native-wrapped object files so the linker > shouldn't have to do anything special to read them).It will be difficult with our current design and without compromise. Either we change our Identify routines to process the section headers (can be costly for COMDAT heavy code or code built with -ffuntion-sections, --fdata-sections). Or we change our ELF scan routines to specially identify and handle ELF with llvmbc differently. Alternatively could we mark the native file header in a way that identifies it as a bitcode wrapper is an option? That way we wouldn't need to parse the section headers upfront to identify how the input should be treated. This is possibly something we can look at as a private change.> Specifically, in the C API these are the lto_module_is_object_file* > variants, which will return true for either straight-up or > native-wrapped bitcode. All of the mechanics of handling bitcode vs > native object-wrapped bitcode are down in the IRObjectFile handling. > So the LTOModule:isBitcode*/lto_module_is_object_file* will correctly > identify native object-wrapped bitcode as bitcode. And the > LTOModule::createFrom*/lto_module_create* routines correctly parse the > native object-wrapped bitcode and return an LTOmodule.It's great to see these API functions have been considered and added to libLTO and I'm sure they will help for other tools. Unfortunately not ours.
Actually Paul that's a very good point and may work. If we could use the bitcode wrapper that wraps the native object wrapper that wraps the bitcode file, then that may actually work out of the box for us. The file format doesn't specify if both wrappers can be used and in what order. Also doubley wrapped kind of makes my head hurt thinking about it... On Wed, Jun 3, 2015 at 5:52 PM, Robinson, Paul <Paul_Robinson at playstation.sony.com> wrote:> > >> -----Original Message----- >> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On >> Behalf Of Teresa Johnson >> Sent: Wednesday, June 03, 2015 7:02 AM >> To: Dave Bozier >> Cc: <llvmdev at cs.uiuc.edu> List >> Subject: Re: [LLVMdev] Updated RFC: ThinLTO Implementation Plan >> >> On Wed, Jun 3, 2015 at 4:19 AM, Dave Bozier <seifsta at gmail.com> wrote: >> > Hi Teresa, >> > >> > Thanks for providing this updated RFC. >> > >> >> For Sony's linker, are you using the gold plugin or libLTO interfaces? >> >> If the latter, I suppose some ThinLTO handling would have to be added >> >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 >> >> combined function map and either launch the backend processes in >> >> parallel or write out a make or other build file). The current support >> >> for reading native object wrapped bitcode is baked into IRObjectFile >> >> so presumably the Sony linker can handle these native object wrapped >> >> bitcode files if it uses libLTO. We would similarly embed the handling >> >> of the function index/summary behind an API that can handle either so >> >> it is similarly transparent to the linkers. Let me know if there would >> >> be additional issues that make wrapped bitcode more difficult in your >> >> case, or how we could make ThinLTO usage simpler for you in general. >> > We use the libLTO interfaces. >> >> Hi Dave, >> >> Thanks for the info. >> >> > >> > We use the libLTO interfaces, more specifically we use the C API >> > located in llvm-c\lto.h. >> > >> > Our linker won't support native object wrapped bitcode files as our >> > LTO is it currently stands. Right now, it will be recognized as an >> > object file and won't get anywhere near the libLTO libraries. We'd >> > need to teach our linker to recognize and differentiate native object >> > wrapped bitcode files and regular native object files. This isn't >> > straight forward as we cannot distinguish them just by looking at the >> > file header alone, we would need to parse the sections and look for a >> > .llvmbc section. We then need to add special handling of these native >> > object wrappers. >> >> Ok, I see. Does it help that there are LTOModule (lto_module_* in the >> C API) interfaces for checking if a file contains bitcode (regardless >> of whether it is straight-up or native-wrapped)? I don't know how hard >> in your linker it is to query these when deciding whether to treat the >> object file as bitcode or not, or how hard it is to pass the resulting >> object file along to the libLTO routines for handling (they >> automatically handle the native-wrapped object files so the linker >> shouldn't have to do anything special to read them). > > One twist is that we use the Darwin-style wrapper around our bitcode files > so that we have a place to hang a bitcode version number, which we also > want to check. Without reopening the debate about why we do that, we do > that, and I fully expect the libLTO API to silently ignore the wrapper that > we are depending on. I suppose we could add a new libLTO API that verifies > the bitcode wrapper but it would be yet another private change to maintain, > rather than just having the linker check it directly. > --paulr > >> >> Specifically, in the C API these are the lto_module_is_object_file* >> variants, which will return true for either straight-up or >> native-wrapped bitcode. All of the mechanics of handling bitcode vs >> native object-wrapped bitcode are down in the IRObjectFile handling. >> So the LTOModule:isBitcode*/lto_module_is_object_file* will correctly >> identify native object-wrapped bitcode as bitcode. And the >> LTOModule::createFrom*/lto_module_create* routines correctly parse the >> native object-wrapped bitcode and return an LTOmodule. >> >> As a result, the llvm-lto tool that also uses libLTO interfaces didn't >> require any changes when the native-wrapped reading support went in >> (r218078), and is able to handle native-wrapped bitcode out of the >> box. >> >> > >> > Handling the function index/summary behind an API sounds like a good >> idea. >> >> I am going to work on fleshing out this part next so that the actual >> format of the files is hidden from clients. >> >> Thanks, >> Teresa >> >> > >> > On Fri, May 29, 2015 at 4:01 PM, Teresa Johnson <tejohnson at google.com> >> wrote: >> >> On Fri, May 29, 2015 at 6:56 AM, Alex Rosenberg <alexr at leftfield.org> >> wrote: >> >>> My earlier statement about wrapping things in a native object file >> held in that it is controversial. It appears to be still central to your >> design. >> >>> >> >>> It may help to look at the problem from a different viewpoint: LLVM is >> not a compiler. It is a framework that can be used to make compiler-like >> tools. >> >>> >> >>> From that view, it no longer makes sense to discuss "the plugin," or >> gold, or $AR, because there isn't just one of any of those things. ld64 >> isn't the only outlier linker to consider. We have our own linker at Sony, >> for example. From this perspective, then it makes more sense to consider >> replacing the binary utilities with ones that support bitcode, because >> from a user-perspective, all of the linkers already transparently support >> bitcode directly today, as do ar, nm, etc. This has been necessary for the >> regular LTO process. >> >> >> >> Hi Alex, >> >> >> >> It's true that the LLVM versions of these tools support bitcode >> >> transparently, but not all build systems use LLVM versions of these >> >> tools, particularly build systems that support a variety of compilers, >> >> or legacy build systems. And not all build systems have the plugin or >> >> currently pass it to the native tools that can take a plugin for >> >> handling bitcode. In those cases the bitcode support is not >> >> transparently available, and our aim is to reduce the friction as much >> >> as possible. And not all use LTO currently (I know we don't due to the >> >> scalability issues we're trying to address with this design), and in >> >> those cases the migration to bitcode-aware tools and plugins was not >> >> previously required. >> >> >> >> For Sony's linker, are you using the gold plugin or libLTO interfaces? >> >> If the latter, I suppose some ThinLTO handling would have to be added >> >> to your linker (e.g. to invoke the LLVM hooks to write the stage-2 >> >> combined function map and either launch the backend processes in >> >> parallel or write out a make or other build file). The current support >> >> for reading native object wrapped bitcode is baked into IRObjectFile >> >> so presumably the Sony linker can handle these native object wrapped >> >> bitcode files if it uses libLTO. We would similarly embed the handling >> >> of the function index/summary behind an API that can handle either so >> >> it is similarly transparent to the linkers. Let me know if there would >> >> be additional issues that make wrapped bitcode more difficult in your >> >> case, or how we could make ThinLTO usage simpler for you in general. >> >> >> >>> >> >>> The only tool in the list of tools you mentioned that do not support >> bitcode directly is objcopy, and that's because nobody has yet written an >> LLVM-project implementation of it. Personally, I'd much rather you focus >> on making ThinLTO work by extending bitcode as needed, and we work as a >> community toward replacing objcopy with an LLVM-native one. It's a big >> missing piece of the LLVM project today and could be so much better if we >> could use it to replace Apple's lipo and possibly other extant object file >> modification tools. (Has anyone surveyed this area?) >> >>> >> >>> That older toolchains have tried to slip non-object file data through >> the binary utilities isn't really proof that this is a good choice. It >> might simply reflect the realities of those engineering teams. I wasn't at >> Sun for this, but DTrace needed a linker feature that apparently the Sun >> linker team was unwilling or unable to provide, so dtrace(1) gained the >> ability to modify ELF files directly as needed. That doesn't prove that >> DTrace's USDT feature shouldn't have been implemented in the linker (as >> ld64 does directly for Apple), does it? >> >> >> >> I'd argue that the realities being addressed by using native object >> >> format in those cases still exist. >> >> >> >>> >> >>> If in the end using native object-wrapped bitcode is the best >> solution, so be it. However, I think it is largely orthogonal to ThinLTO's >> needs for transporting symtab data alongside the existing bitcode format. >> >> >> >> That's certainly true, ThinLTO can be implemented using either format, >> >> and bitcode only support can certainly be implemented. It is a matter >> >> of prioritizing which format to implement first. I had added some >> >> description to the updated RFC on how the function index/summary can >> >> be represented, etc in bitcode. Prioritizing the native object format >> >> doesn't make it easier to implement ThinLTO, but should make it easier >> >> to deploy. >> >> >> >> Thanks! >> >> Teresa >> >> >> >>> >> >>> Alex >> >>> >> >>>> On May 28, 2015, at 2:10 PM, Teresa Johnson <tejohnson at google.com> >> wrote: >> >>>> >> >>>> As promised, here is an new version of the ThinLTO RFC, updated based >> >>>> on some of the comments, questions and feedback from the first RFC. >> >>>> Hopefully we have addressed many of these, and as noted below, will >> >>>> fork some of the detailed discussion on particular aspects into >> >>>> separate design doc threads. Please send any additional feedback and >> >>>> questions on the overall design. >> >>>> Thanks! >> >>>> Teresa >> >>>> >> >>>> >> >>>> Updated RFC to discuss plans for implementing ThinLTO upstream, >> >>>> reflecting feedback and discussion from initial RFC >> >>>> (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085557.html). As >> >>>> discussed in the earlier thread and below, more detailed design >> >>>> documents for several pieces (native object format, linkage type >> >>>> changes and static promotions, etc) are in progress and will be sent >> >>>> separately. This RFC covers the overall design and the breakdown of >> >>>> work at a higher level. >> >>>> >> >>>> >> >>>> Background on ThinLTO can be found in slides from EuroLLVM 2015: >> >>>> >> https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0 >> >>>> As described in the talk, we have a prototype implementation, and >> >>>> would like to start staging patches upstream. This RFC describes a >> >>>> breakdown of the major pieces. We would like to commit upstream >> >>>> gradually in several stages, with all functionality off by default. >> >>>> The core ThinLTO importing support and tuning will require frequent >> >>>> change and iteration during testing and tuning, and for that part we >> >>>> would like to commit rapidly (off by default). See the proposed >> staged >> >>>> implementation described in the Implementation Plan section. >> >>>> >> >>>> >> >>>> ThinLTO Overview >> >>>> =================>> >>>> >> >>>> >> >>>> See the talk slides linked above for more details. The following is a >> >>>> high-level overview of the motivation. >> >>>> >> >>>> >> >>>> Cross Module Optimization (CMO) is an effective means for improving >> >>>> runtime performance, by extending the scope of optimizations across >> >>>> source module boundaries. Without CMO, the compiler is limited to >> >>>> optimizing within the scope of single source modules. Two solutions >> >>>> for enabling CMO are Link-Time Optimization (LTO), which is currently >> >>>> supported in LLVM and GCC, and Lightweight-Interprocedural >> >>>> Optimization (LIPO). However, each of these solutions has limitations >> >>>> that prevent it from being enabled by default. ThinLTO is a new >> >>>> approach that attempts to address these limitations, with a goal of >> >>>> being enabled more broadly. ThinLTO is designed with many of the same >> >>>> principals as LIPO, and therefore its advantages, without any of its >> >>>> inherent weakness. Unlike in LIPO where the module group decision is >> >>>> made at profile training runtime, ThinLTO makes the decision at >> >>>> compile time, but in a lazy mode that facilitates large scale >> >>>> parallelism. LTO implementations all contain a serial IPA/IPO step >> >>>> that is both memory intensive and slow, limiting usability on both >> >>>> smaller workstations and huge applications. In contrast, the ThinLTO >> >>>> serial linker plugin phase is designed to be razor thin and blazingly >> >>>> fast. By default this step only does minimal preparation work to >> >>>> enable the parallel lazy importing performed later. ThinLTO aims to >> be >> >>>> scalable like a regular O2 build, enabling CMO on machines without >> >>>> large memory configurations, while also integrating well with >> >>>> distributed build systems. Results from early prototyping on SPEC >> >>>> cpu2006 C++ benchmarks are in line with expectations that ThinLTO can >> >>>> scale like O2 while enabling much of the CMO performed during a full >> >>>> LTO build. >> >>>> >> >>>> >> >>>> A ThinLTO build is divided into 3 phases, which are referred to in >> the >> >>>> following implementation plan: >> >>>> 1. phase-1: IR and Function Summary Generation (-c compile) >> >>>> 2. phase-2: Thin Linker Plugin Layer (thin archive linker step) >> >>>> 3. phase-3: Parallel Backend with Demand-Driven Importing >> >>>> >> >>>> >> >>>> Implementation Plan >> >>>> ===================>> >>>> >> >>>> >> >>>> This section gives a high-level breakdown of the ThinLTO support that >> >>>> will be added, in roughly the order that the patches would be staged. >> >>>> The patches are divided into three stages. The first stage contains a >> >>>> minimal amount of preparation work that is not ThinLTO-specific. The >> >>>> second stage contains most of the infrastructure for ThinLTO, which >> >>>> will be off by default. The third stage includes >> >>>> enhancements/improvements/tunings that can be performed after the >> main >> >>>> ThinLTO infrastructure is in. >> >>>> >> >>>> >> >>>> The second and third implementation stages will initially be very >> >>>> volatile, requiring a lot of iterations and tuning with large apps to >> >>>> get stabilized. Therefore it will be important to do fast commits for >> >>>> these implementation stages. >> >>>> >> >>>> >> >>>> 1. Stage 1: Preparation >> >>>> ------------------------------------ >> >>>> >> >>>> >> >>>> The first planned sets of patches are enablers for ThinLTO work: >> >>>> >> >>>> >> >>>> a. LTO directory structure >> >>>> >> >>>> >> >>>> Restructure the LTO directory to remove circular dependence when >> >>>> ThinLTO pass added. Because ThinLTO is being implemented as a SCC >> pass >> >>>> within Transforms/IPO, and leverages the LTOModule class for linking >> >>>> in functions from modules, IPO then requires the LTO library. This >> >>>> creates a circular dependence between LTO and IPO. To break that, we >> >>>> need to split the lib/LTO directory/library into lib/LTO/CodeGen and >> >>>> lib/LTO/Module, containing LTOCodeGenerator and LTOModule, >> >>>> respectively. Only LTOCodeGenerator has a dependence on IPO, removing >> >>>> the circular dependence. >> >>>> >> >>>> >> >>>> Note that libLTO and llvm-lto use LTOModule/LTOCodeGenerator, whereas >> >>>> the gold plugin uses lib/Object/IRObject and lib/Linker directly. The >> >>>> use of LTOModule in the ThinLTO pass is a convenience, but could be >> >>>> avoided by using the IRObject/Linker methods directly if that is >> >>>> preferred. >> >>>> >> >>>> >> >>>> b. Native object wrapper generation support >> >>>> >> >>>> >> >>>> Implement native-object wrapped bitcode writer. The main goal is to >> >>>> more easily interact with existing native tools such as $AR, $NM, >> “$LD >> >>>> -r”, $OBJCOPY, and $RANLIB, without requiring the build system to >> find >> >>>> and pass the plugin as an option. We plan to emit the phase-1 bitcode >> >>>> wrapped in native object format via the .llvmbc section, along with a >> >>>> symbol table. We will implement ELF first, but subsequently extend >> >>>> support to COFF and Mach-O. Additionally, we also want to avoid doing >> >>>> partial LTO/ThinLTO across files linked with “$LD -r” (i.e. the >> >>>> resulting object file should still contain native object-wrapped >> >>>> bitcode to enable ThinLTO at the full link step). I will send a >> >>>> separate design document for these changes, including the format of >> >>>> the symtab and function index/summary section, but the following is a >> >>>> high-level motivation and overview. >> >>>> >> >>>> >> >>>> Note that support for ThinLTO using bitcode can be added as a >> >>>> follow-on under an option, so that bitcode-aware tools do not need to >> >>>> use the wrapper. Under the bitcode-only option, the symbol table will >> >>>> be replaced by the bitcode form of the function index and summary >> >>>> section, which can be encoded as a new bitcode block type. Changes >> >>>> should be made to the gold plugin to avoid partial link of bitcode >> >>>> files under “$LD -r” (emitting bitcode rather than compiling all the >> >>>> way down to native code, which is how ld64 behaves on Darwin as per >> >>>> dexonsmith). >> >>>> >> >>>> >> >>>> Advantages of using native object format: >> >>>> * Out of the box interoperability with existing native build tools >> >>>> ($AR, $NM, “$LD -r”, $OBJCOPY, and $RANLIB) which may not currently >> >>>> know how to locate/pass the appropriate plugin. >> >>>> * There is precedence in using this format: other compilers also wrap >> >>>> intermediate LTO files (probably related to the above advantage)[1]. >> >>>> * Tools that modify symbol linkage and visibility (e.g. $OBJCOPY and >> >>>> “$LD -r”) can mark the change in the symbol table without needing to >> >>>> parse/change/encode bitcode. The change can be propagated to bitcode >> >>>> by the ThinLTO backend. >> >>>> * Some tools only need to read/write the symtab and can avoid >> >>>> parsing/encoding bitcode (e.g. $NM, $OBJCOPY). >> >>>> * The second phase of ThinLTO does not need to parse the bitcode when >> >>>> creating the combined function index. >> >>>> >> >>>> >> >>>> Disadvantages of using native object format: >> >>>> * Unnecessary when using plugins with plugin-aware native tools, or >> >>>> LLVM’s custom tools. >> >>>> * Slightly increase disk storage and I/O from symtab. However, with >> >>>> our design the symtab is leveraged to hold function indexing info >> >>>> required for ThinLTO. The I/O for some build tools and build steps >> can >> >>>> actually be reduced as there is no need to read the bitcode, as >> >>>> described above. >> >>>> >> >>>> >> >>>> Support was added to LLVM for reading native object-wrapped bitcode >> >>>> (http://reviews.llvm.org/rL218078), but there does not yet exist >> >>>> support in LLVM/Clang for emitting bitcode wrapped in native object >> >>>> format. I plan to add support for optionally generating bitcode in an >> >>>> native object file containing a single .llvmbc section holding the >> >>>> bitcode. Specifically, the patch would add new options >> >>>> “emit-llvm-native-object” (object file) and corresponding >> >>>> “emit-llvm-native-assembly” (textual assembly code equivalent). >> >>>> Eventually these would be automatically triggered under “-fthinlto - >> c” >> >>>> and “-fthinlto -S”, respectively. >> >>>> >> >>>> >> >>>> Additionally, a symbol table will be generated in the native object >> >>>> file, holding the function symbols within the bitcode. This >> >>>> facilitates handling archives of the native object-wrapped bitcode >> >>>> created with $AR, since the archive will have a symbol table as well. >> >>>> The archive symbol table enables gold to extract and pass to the >> >>>> plugin the constituent native object-wrapped bitcode files. To >> support >> >>>> the concatenated llvmbc section generated by “$LD -r”, some handling >> >>>> needs to be added to gold and to the backend driver to process each >> >>>> original module’s bitcode. >> >>>> >> >>>> >> >>>> The function index/summary will later be added as a special native >> >>>> object section alongside the .llvmbc sections. The offset and size of >> >>>> the corresponding function summary can be placed in the associated >> >>>> symtab entry. As noted above, a separate design document will be sent >> >>>> for the native object format changes. >> >>>> >> >>>> >> >>>> 2. Stage 2: ThinLTO Infrastructure >> >>>> ------------------------------------------------------ >> >>>> >> >>>> >> >>>> The next set of patches adds the base implementation of the ThinLTO >> >>>> infrastructure, specifically those required to make ThinLTO >> functional >> >>>> and generate correct but not necessarily high-performing binaries. >> >>>> >> >>>> >> >>>> a. Clang/LLVM/gold linker options >> >>>> >> >>>> >> >>>> An early set of clang/llvm patches is needed to provide options to >> >>>> enable ThinLTO (off by default), so that the rest of the >> >>>> implementation can be disabled by default as it is added. >> >>>> Specifically, clang options -fthinlto (used instead of -flto) will >> >>>> cause clang to invoke the phase-1 emission of LLVM bitcode and >> >>>> function summary/index on a compile step, and pass the appropriate >> >>>> option to the gold plugin on a link step. The -thinlto option will be >> >>>> added to the gold plugin and llvm-lto tool to launch the phase-2 thin >> >>>> archive step. The -thinlto-be option will also be added to clang to >> >>>> invoke it as a phase-3 parallel backend instance with a bitcode file >> >>>> as input. >> >>>> >> >>>> >> >>>> b. Thin-archive linking support in Gold plugin and llvm-lto >> >>>> >> >>>> >> >>>> Under the new plugin option (see above), the plugin needs to perform >> >>>> the phase-2 (thin archive) link which simply emits a combined >> function >> >>>> index from the linked modules, without actually performing the normal >> >>>> link. Corresponding support should be added to the standalone llvm- >> lto >> >>>> tool to enable testing/debugging without involving the linker and >> >>>> plugin. >> >>>> >> >>>> >> >>>> c. ThinLTO backend support >> >>>> >> >>>> >> >>>> Support for invoking a phase-3 backend invocation (including >> >>>> importing) on a module should be added to the clang driver under the >> >>>> new option. The main change under the option is to instantiate a >> >>>> Linker object used to manage the process of linking imported >> functions >> >>>> into the module, efficient read of the combined function index, and >> >>>> enable the ThinLTO import pass. >> >>>> >> >>>> >> >>>> d. Function index/summary support >> >>>> >> >>>> >> >>>> This includes infrastructure for writing and reading the function >> >>>> index/summary section. As noted earlier this will be encoded in a >> >>>> special section within the native object file for the module, >> >>>> alongside the .llvmbc section containing the bitcode. The thin >> archive >> >>>> (combined function index) generated by phase-2 of ThinLTO simply >> >>>> contains all of the function index/summary sections across the linked >> >>>> modules, organized for efficient function lookup. As mentioned >> earlier >> >>>> when discussing the native object wrapper format, a separate design >> >>>> document will be sent for this format. >> >>>> >> >>>> >> >>>> Each function available for importing from the module contains an >> >>>> entry in the module’s function index/summary section and in the >> >>>> resulting combined function index. Each function entry contains that >> >>>> function’s offset within the bitcode file, used to efficiently locate >> >>>> and quickly import just that function (see below in 2e for more >> >>>> details on the importing mechanics). The entry also contains summary >> >>>> information (e.g. basic information determined during parsing such as >> >>>> the number of instructions in the function), that will be used to >> help >> >>>> guide later import decisions. Because the contents of this section >> >>>> will change frequently during ThinLTO tuning, it should also be >> marked >> >>>> with a version id for backwards compatibility or version checking. >> >>>> >> >>>> >> >>>> e. ThinLTO importing support >> >>>> >> >>>> >> >>>> Support for the mechanics of importing functions from other modules, >> >>>> which can go in gradually as a set of patches since it will be off by >> >>>> default (the ThinLTO pass itself discussed below in 2f). >> >>>> >> >>>> >> >>>> Note that ThinLTO function importing is iterative, and we may import >> >>>> from a number of modules in an interleaved fashion. For example, >> >>>> assume we have hot call chains a()->b1()->c() and a()->b2()->d(), >> >>>> where functions a(), b1()/b2(), c() and d() are from modules A, B, C >> >>>> and D, respectively. When performing ThinLTO backend compilation of >> >>>> module A, we may decide to import in the following order (based on >> >>>> callsite and function summary info): >> >>>> 1. B::b1() # exposes call to c() >> >>>> 2. C::c() >> >>>> 3. B::b2() # exposes call to d() >> >>>> 4. D::d() >> >>>> For this reason, ThinLTO importing is different than regular LTO >> >>>> bitcode reading and linking, which reads and links in a module in its >> >>>> entirety on a single pass through each module (notice in the above >> >>>> example the imports of the two module B functions have an intervening >> >>>> import from module C). As a result, for example, the existing support >> >>>> for lazy metadata parsing that delays it until the first function is >> >>>> materialized can’t be leveraged (metadata handling is discussed more >> >>>> below in 2h). Therefore, the ThinLTO importing pass instantiates a >> new >> >>>> BitcodeReader and LTOModule object for each function we decide to >> >>>> import, parsing only what is needed and linking in just that >> function. >> >>>> This is fast and efficient as found in the prototype results shown in >> >>>> the linked EuroLLVM slides. >> >>>> >> >>>> >> >>>> Separate patches can include: >> >>>> >> >>>> >> >>>> * BitcodeReader changes to use function index to import/deserialize >> >>>> single function of interest (small changes, leverages existing lazy >> >>>> function streamer support). The declarations and other symbol table >> >>>> info in the bitcode must be reloaded, but the bitcode parsing can >> stop >> >>>> once the first function body is hit. We simply set up an entry in the >> >>>> lazy streamer’s DeferredFunctionInfo function index map from the >> >>>> bitcode index that was saved in the ThinLTO function summary (and >> >>>> therefore don’t need to build up this function index structure >> through >> >>>> repeated calls to RememberAndSkipFunctionBody via >> >>>> FindFunctionInStream). >> >>>> * Minor LTOModule changes to pass the ThinLTO function to import and >> >>>> its index into bitcode reader (see 1a for discussion on LTOModule >> >>>> use). >> >>>> * Marking of imported functions. Most handling for ThinLTO imported >> >>>> functions will simply rely on applying the appropriate linkage type. >> >>>> But it is useful to know which functions were imported, both for >> >>>> compiler debugging and and verification, and possibly to modify some >> >>>> optimization heuristics along with the summary information. This can >> >>>> be in-memory initially, but IR support may be required in order to >> >>>> support streaming bitcode out and back in again after importing. >> >>>> * ModuleLinker changes to do ThinLTO-specific symbol linking and >> >>>> static promotion when necessary. The linkage type of imported >> >>>> non-local functions and variables changes to >> >>>> AvailableExternallyLinkage, for example. Statics must be promoted in >> >>>> certain cases, and accordingly renamed in consistent ways. Read-write >> >>>> or address-taken static variables must always be promoted. Other >> >>>> discardable functions, i.e. link-once such as comdats, will be force >> >>>> imported on reference by another imported function. We are working on >> >>>> a separate design document describing these changes in more detail >> >>>> with examples, as a more detailed discussion of these changes is >> >>>> beyond the scope of this RFC. >> >>>> * GlobalDCE changes to support removing imported non-local functions >> >>>> that were not inlined and imported non-local variables, which are >> >>>> marked AvailableExternallyLinkage (very small changes to existing >> pass >> >>>> logic). As discussed in the original RFC threads, currently GlobalDCE >> >>>> does not remove referenced AvailableExternallyLinkage functions. >> >>>> Instead, these are suppressed later during code generation. It isn’t >> >>>> clear that these functions are useful past the first call to >> >>>> GlobalDCE, which is after inlining, GlobalOpt and IPSCCP (so >> >>>> presumably after inter procedural constant prop, etc). Patch with >> >>>> these changes in testing as discussed in this thread: >> >>>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-May/085807.html. >> >>>> >> >>>> >> >>>> f. ThinLTO Import Driver SCC pass >> >>>> >> >>>> >> >>>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via >> >>>> an SCC pass, enabled only under the -fthinlto-be option. The pass >> >>>> includes utilizing the thin archive[2] (combined global function >> >>>> index/summary), import decision heuristics, invocation of >> >>>> LTOModule/ModuleLinker routines that perform the import, and any >> >>>> necessary callgraph updates and verification. >> >>>> >> >>>> >> >>>> g. Backend Driver >> >>>> >> >>>> >> >>>> For a single node build, the gold plugin will initially exec the >> >>>> backend processes directly, with the amount of parallelism controlled >> >>>> via an option and/or env variable. It is also possible to leverage >> >>>> existing single node build system task dispatching mechanisms such as >> >>>> Unix Makefiles, Ninja, etc., where the plugin can simply write a >> build >> >>>> file and fork the parallel backend instances directly under an >> >>>> appropriate option. We will also initially add support for our >> >>>> distributed build system as described below under 3c. >> >>>> >> >>>> >> >>>> h. Lazy Debug Metadata Linking >> >>>> >> >>>> >> >>>> The prototype implementation included lazy importing of module-level >> >>>> metadata during the ThinLTO pass finalization (i.e. after all >> function >> >>>> importing is complete). This actually applies to all module-level >> >>>> metadata, not just debug, although it is the largest. This can be >> >>>> added as a separate set of patches, and the detailed design will be >> >>>> sent with those. Includes changes to BitcodeReader, ValueMapper, and >> >>>> the ModuleLinker classes. As described in 2e, due to the >> >>>> iterative/interleaved nature of ThinLTO importing, the bitcode >> parsing >> >>>> is structured differently than LTO where a single pass over each >> >>>> module can be performed to parse and materialize all functions and >> >>>> metadata. Therefore, the lazy metadata parsing support in >> >>>> BitcodeReader, which parses all the metadata once the first function >> >>>> is materialized, are not applicable. We may instantiate a >> >>>> BitcodeReader multiple times for a module, if multiple functions are >> >>>> eventually imported, and we need a way to suture up the metadata to >> >>>> the functions imported by an earlier BitcodeReader instantiation. The >> >>>> high level summary is that during the initial import we leave the >> >>>> temporary metadata on the instructions that were imported, but save >> >>>> the index used by the bitcode reader used to correlate with the >> >>>> metadata when it is ready (i.e. the MDValuePtrs index), and skip the >> >>>> metadata parsing. During the ThinLTO pass finalization we parse just >> >>>> the metadata, and suture it up during metadata value mapping using >> the >> >>>> saved index. As mentioned earlier, this will be described in more >> >>>> detail when the patches are ready. >> >>>> >> >>>> >> >>>> 3. Stage 3: ThinLTO Tuning and Enhancements >> >>>> --------------------------------------------------------------------- >> ---- >> >>>> >> >>>> >> >>>> This refers to the patches that are not required for ThinLTO to work, >> >>>> but rather to improve compile time, memory, run-time performance and >> >>>> usability. >> >>>> >> >>>> >> >>>> a. Import Tuning >> >>>> >> >>>> >> >>>> Tuning the import strategy will be an iterative process that will >> >>>> continue to be refined over time. It involves several different types >> >>>> of changes: adding support for recording additional metrics in the >> >>>> function summary, such as profile data and optional heavier-weight >> IPA >> >>>> analyses, and tuning the import heuristics based on the summary and >> >>>> callsite context. >> >>>> >> >>>> >> >>>> b. Combined Function Index Pruning >> >>>> >> >>>> >> >>>> The combined function index can be pruned of functions that are >> >>>> unlikely to benefit from being imported. For example, during the >> >>>> phase-2 thin archive plug step we can safely omit large and (with >> >>>> profile data) cold functions, which are unlikely to benefit from >> being >> >>>> inlined. Additionally, all but one copy of comdat functions can be >> >>>> suppressed. >> >>>> >> >>>> >> >>>> c. Distributed Build System Integration >> >>>> >> >>>> >> >>>> For a distributed build system such as Bazel (http://bazel.io/), the >> >>>> gold plugin should write the parallel backend invocations into a >> build >> >>>> file, including the mapping from the IR file to the real object file >> >>>> path, and exit. Additional work needs to be done in the distributed >> >>>> build system itself to distribute and dispatch the parallel backend >> >>>> jobs to the build cluster. >> >>>> >> >>>> >> >>>> d. Dependence Tracking and Incremental Compiles >> >>>> >> >>>> >> >>>> In order to support build systems that stage from local disks or >> >>>> network storage, the plugin will optionally support computation of >> >>>> dependent sets of IR files that each module may import from. This can >> >>>> be computed from profile data, if it exists, or from the symbol table >> >>>> and heuristics if not. These dependence sets also enable support for >> >>>> incremental backend compiles. >> >>>> >> >>>> >> >>>> ________________ >> >>>> [1] The following compilers currently wrap intermediate LTO files in >> >>>> native object format: GCC fat and non-fat objects (with a custom >> >>>> symtab), Intel icc non-fat (IR-only) objects (with a full native >> >>>> symtab), HP’s aCC non-fat objects (with full native symtab), IBM xlC >> >>>> both fat and non-fat objects (with full native symtab). >> >>>> [2] The “thin archive” here (also referred to as a combined function >> >>>> index) has some similarities to the AR tool thin archive format, but >> >>>> is not exactly the same. Both contain the symtab and not the code, >> but >> >>>> the ThinLTO combined function index contains the summary sections as >> >>>> well. >> >>>> >> >>>> -- >> >>>> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460- >> 2413 >> >>>> >> >>>> _______________________________________________ >> >>>> LLVM Developers mailing list >> >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> >> >> >> >> >> -- >> >> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460- >> 2413 >> >> >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Xinliang David Li
2015-Jun-03 17:11 UTC
[LLVMdev] Updated RFC: ThinLTO Implementation Plan
On Wed, Jun 3, 2015 at 9:54 AM, Dave Bozier <seifsta at gmail.com> wrote:> > Ok, I see. Does it help that there are LTOModule (lto_module_* in the > > C API) interfaces for checking if a file contains bitcode (regardless > > of whether it is straight-up or native-wrapped)? > > Unfortunately no it won't help. We try to identify inputs on the > command line in the order 1) Object 2) static library 3) Bitcode 4) > Linker script. Our LTO implementation is split into a separate process > that is lazily loaded only in the case that we are trying to identify > an input and don't recognize it as an object or static library and > suspect it is a bitcode file. > >This sounds a little fragile and inflexible. How is the command line formed?> > I don't know how hard > > in your linker it is to query these when deciding whether to treat the > > object file as bitcode or not, or how hard it is to pass the resulting > > object file along to the libLTO routines for handling (they > > automatically handle the native-wrapped object files so the linker > > shouldn't have to do anything special to read them). > > It will be difficult with our current design and without compromise. > Either we change our Identify routines to process the section headers > (can be costly for COMDAT heavy code or code built with > -ffuntion-sections, --fdata-sections). Or we change our ELF scan > routines to specially identify and handle ELF with llvmbc differently. > >At least for ELF, there is a e_machine field in the Elf header that can potentially be useful here. thanks, David Alternatively could we mark the native file header in a way that> identifies it as a bitcode wrapper is an option? That way we wouldn't > need to parse the section headers upfront to identify how the input > should be treated. This is possibly something we can look at as a > private change. > > > Specifically, in the C API these are the lto_module_is_object_file* > > variants, which will return true for either straight-up or > > native-wrapped bitcode. All of the mechanics of handling bitcode vs > > native object-wrapped bitcode are down in the IRObjectFile handling. > > So the LTOModule:isBitcode*/lto_module_is_object_file* will correctly > > identify native object-wrapped bitcode as bitcode. And the > > LTOModule::createFrom*/lto_module_create* routines correctly parse the > > native object-wrapped bitcode and return an LTOmodule. > > It's great to see these API functions have been considered and added > to libLTO and I'm sure they will help for other tools. Unfortunately > not ours. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150603/644ce39c/attachment.html>