thr3ads.net - llvm dev - [LLVMdev] RFC: ThinLTO Impementation Plan [May 2015]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2015-May-14 06:23 UTC

[LLVMdev] RFC: ThinLTO Impementation Plan

On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <alexr at leftfield.org>
wrote:
> "ELF-wrapped bitcode" seems potentially controversial to me.
>
> What about ar, nm, and various ld implementations adds this requirement?
> What about the LLVM implementations of these tools is lacking?
>
Sorry I can not parse your questions properly. Can you make it clearer?

David

>
> Alex
>
> > On May 13, 2015, at 7:44 PM, Teresa Johnson <tejohnson at
google.com>
> wrote:
> >
> > I've included below an RFC for implementing ThinLTO in LLVM,
looking
> > forward to feedback and questions.
> > Thanks!
> > Teresa
> >
> >
> >
> > RFC to discuss plans for implementing ThinLTO upstream. Background can
> > be found in slides from EuroLLVM 2015:
> >
>
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> > As described in the talk, we have a prototype implementation, and
> > would like to start staging patches upstream. This RFC describes a
> > breakdown of the major pieces. We would like to commit upstream
> > gradually in several stages, with all functionality off by default.
> > The core ThinLTO importing support and tuning will require frequent
> > change and iteration during testing and tuning, and for that part we
> > would like to commit rapidly (off by default). See the proposed staged
> > implementation described in the Implementation Plan section.
> >
> >
> > ThinLTO Overview
> > =============> >
> > See the talk slides linked above for more details. The following is a
> > high-level overview of the motivation.
> >
> > Cross Module Optimization (CMO) is an effective means for improving
> > runtime performance, by extending the scope of optimizations across
> > source module boundaries. Without CMO, the compiler is limited to
> > optimizing within the scope of single source modules. Two solutions
> > for enabling CMO are Link-Time Optimization (LTO), which is currently
> > supported in LLVM and GCC, and Lightweight-Interprocedural
> > Optimization (LIPO). However, each of these solutions has limitations
> > that prevent it from being enabled by default. ThinLTO is a new
> > approach that attempts to address these limitations, with a goal of
> > being enabled more broadly. ThinLTO is designed with many of the same
> > principals as LIPO, and therefore its advantages, without any of its
> > inherent weakness. Unlike in LIPO where the module group decision is
> > made at profile training runtime, ThinLTO makes the decision at
> > compile time, but in a lazy mode that facilitates large scale
> > parallelism. The serial linker plugin phase is designed to be razor
> > thin and blazingly fast. By default this step only does minimal
> > preparation work to enable the parallel lazy importing performed
> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
> > CMO on machines without large memory configurations, while also
> > integrating well with distributed build systems. Results from early
> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
> > expectations that ThinLTO can scale like O2 while enabling much of the
> > CMO performed during a full LTO build.
> >
> >
> > A ThinLTO build is divided into 3 phases, which are referred to in the
> > following implementation plan:
> >
> > phase-1: IR and Function Summary Generation (-c compile)
> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
> > phase-3: Parallel Backend with Demand-Driven Importing
> >
> >
> > Implementation Plan
> > ===============> >
> > This section gives a high-level breakdown of the ThinLTO support that
> > will be added, in roughly the order that the patches would be staged.
> > The patches are divided into three stages. The first stage contains a
> > minimal amount of preparation work that is not ThinLTO-specific. The
> > second stage contains most of the infrastructure for ThinLTO, which
> > will be off by default. The third stage includes
> > enhancements/improvements/tunings that can be performed after the main
> > ThinLTO infrastructure is in.
> >
> > The second and third implementation stages will initially be very
> > volatile, requiring a lot of iterations and tuning with large apps to
> > get stabilized. Therefore it will be important to do fast commits for
> > these implementation stages.
> >
> >
> > 1. Stage 1: Preparation
> > -------------------------------
> >
> > The first planned sets of patches are enablers for ThinLTO work:
> >
> >
> > a. LTO directory structure:
> >
> > Restructure the LTO directory to remove circular dependence when
> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> > within Transforms/IPO, and leverages the LTOModule class for linking
> > in functions from modules, IPO then requires the LTO library. This
> > creates a circular dependence between LTO and IPO. To break that, we
> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> > the circular dependence.
> >
> >
> > b. ELF wrapper generation support:
> >
> > Implement ELF wrapped bitcode writer. In order to more easily interact
> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> > table. The goal is both to interact with these tools without requiring
> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
> > linked with “$LD -r” (i.e. the resulting object file should still
> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
> > I will send a separate design document for these changes, but the
> > following is a high-level overview.
> >
> > Support was added to LLVM for reading ELF-wrapped bitcode
> > (http://reviews.llvm.org/rL218078), but there does not yet exist
> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> > add support for optionally generating bitcode in an ELF file
> > containing a single .llvmbc section holding the bitcode. Specifically,
> > the patch would add new options “emit-llvm-bc-elf” (object file) and
> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
> > Eventually these would be automatically triggered under “-fthinlto -c”
> > and “-fthinlto -S”, respectively.
> >
> > Additionally, a symbol table will be generated in the ELF file,
> > holding the function symbols within the bitcode. This facilitates
> > handling archives of the ELF-wrapped bitcode created with $AR, since
> > the archive will have a symbol table as well. The archive symbol table
> > enables gold to extract and pass to the plugin the constituent
> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
> > generated by “$LD -r”, some handling needs to be added to gold and to
> > the backend driver to process each original module’s bitcode.
> >
> > The function index/summary will later be added as a special ELF
> > section alongside the .llvmbc sections.
> >
> >
> > 2. Stage 2: ThinLTO Infrastructure
> > ----------------------------------------------
> >
> > The next set of patches adds the base implementation of the ThinLTO
> > infrastructure, specifically those required to make ThinLTO functional
> > and generate correct but not necessarily high-performing binaries. It
> > also does not include support to make debug support under -g efficient
> > with ThinLTO.
> >
> >
> > a. Clang/LLVM/gold linker options:
> >
> > An early set of clang/llvm patches is needed to provide options to
> > enable ThinLTO (off by default), so that the rest of the
> > implementation can be disabled by default as it is added.
> > Specifically, clang options -fthinlto (used instead of -flto) will
> > cause clang to invoke the phase-1 emission of LLVM bitcode and
> > function summary/index on a compile step, and pass the appropriate
> > option to the gold plugin on a link step. The -thinlto option will be
> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> > archive step. The -thinlto option will also be added to the ‘opt’ tool
> > to invoke it as a phase-3 parallel backend instance.
> >
> >
> > b. Thin-archive linking support in Gold plugin and llvm-lto:
> >
> > Under the new plugin option (see above), the plugin needs to perform
> > the phase-2 (thin archive) link which simply emits a combined function
> > map from the linked modules, without actually performing the normal
> > link. Corresponding support should be added to the standalone llvm-lto
> > tool to enable testing/debugging without involving the linker and
> > plugin.
> >
> >
> > c. ThinLTO backend support:
> >
> > Support for invoking a phase-3 backend invocation (including
> > importing) on a module should be added to the ‘opt’ tool under the new
> > option. The main change under the option is to instantiate a Linker
> > object used to manage the process of linking imported functions into
> > the module, efficient read of the combined function map, and enable
> > the ThinLTO import pass.
> >
> >
> > d. Function index/summary support:
> >
> > This includes infrastructure for writing and reading the function
> > index/summary section. As noted earlier this will be encoded in a
> > special ELF section within the module, alongside the .llvmbc section
> > containing the bitcode. The thin archive generated by phase-2 of
> > ThinLTO simply contains all of the function index/summary sections
> > across the linked modules, organized for efficient function lookup.
> >
> > Each function available for importing from the module contains an
> > entry in the module’s function index/summary section and in the
> > resulting combined function map. Each function entry contains that
> > function’s offset within the bitcode file, used to efficiently locate
> > and quickly import just that function. The entry also contains summary
> > information (e.g. basic information determined during parsing such as
> > the number of instructions in the function), that will be used to help
> > guide later import decisions. Because the contents of this section
> > will change frequently during ThinLTO tuning, it should also be marked
> > with a version id for backwards compatibility or version checking.
> >
> >
> > e. ThinLTO importing support:
> >
> > Support for the mechanics of importing functions from other modules,
> > which can go in gradually as a set of patches since it will be off by
> > default. Separate patches can include:
> >
> > - BitcodeReader changes to use function index to import/deserialize
> > single function of interest (small changes, leverages existing lazy
> > streamer support).
> >
> > - Minor LTOModule changes to pass the ThinLTO function to import and
> > its index into bitcode reader.
> >
> > - Marking of imported functions (for use in ThinLTO-specific symbol
> > linking and global DCE, for example). This can be in-memory initially,
> > but IR support may be required in order to support streaming bitcode
> > out and back in again after importing.
> >
> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
> > static promotion when necessary. The linkage type of imported
> > functions changes to AvailableExternallyLinkage, for example. Statics
> > must be promoted in certain cases, and renamed in consistent ways.
> >
> > - GlobalDCE changes to support removing imported functions that were
> > not inlined (very small changes to existing pass logic).
> >
> >
> > f. ThinLTO Import Driver SCC pass:
> >
> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> > an SCC pass, enabled only under -fthinlto options. The pass includes
> > utilizing the thin archive (global function index/summary), import
> > decision heuristics, invocation of LTOModule/ModuleLinker routines
> > that perform the import, and any necessary callgraph updates and
> > verification.
> >
> >
> > g. Backend Driver:
> >
> > For a single node build, the gold plugin can simply write a makefile
> > and fork the parallel backend instances directly via parallel make.
> >
> >
> > 3. Stage 3: ThinLTO Tuning and Enhancements
> > ----------------------------------------------------------------
> >
> > This refers to the patches that are not required for ThinLTO to work,
> > but rather to improve compile time, memory, run-time performance and
> > usability.
> >
> >
> > a. Lazy Debug Metadata Linking:
> >
> > The prototype implementation included lazy importing of module-level
> > metadata during the ThinLTO pass finalization (i.e. after all function
> > importing is complete). This actually applies to all module-level
> > metadata, not just debug, although it is the largest. This can be
> > added as a separate set of patches. Changes to BitcodeReader,
> > ValueMapper, ModuleLinker
> >
> >
> > b. Import Tuning:
> >
> > Tuning the import strategy will be an iterative process that will
> > continue to be refined over time. It involves several different types
> > of changes: adding support for recording additional metrics in the
> > function summary, such as profile data and optional heavier-weight IPA
> > analyses, and tuning the import heuristics based on the summary and
> > callsite context.
> >
> >
> > c. Combined Function Map Pruning:
> >
> > The combined function map can be pruned of functions that are unlikely
> > to benefit from being imported. For example, during the phase-2 thin
> > archive plug step we can safely omit large and (with profile data)
> > cold functions, which are unlikely to benefit from being inlined.
> > Additionally, all but one copy of comdat functions can be suppressed.
> >
> >
> > d. Distributed Build System Integration:
> >
> > For a distributed build system, the gold plugin should write the
> > parallel backend invocations into a makefile, including the mapping
> > from the IR file to the real object file path, and exit. Additional
> > work needs to be done in the distributed build system itself to
> > distribute and dispatch the parallel backend jobs to the build
> > cluster.
> >
> >
> > e. Dependence Tracking and Incremental Compiles:
> >
> > In order to support build systems that stage from local disks or
> > network storage, the plugin will optionally support computation of
> > dependent sets of IR files that each module may import from. This can
> > be computed from profile data, if it exists, or from the symbol table
> > and heuristics if not. These dependence sets also enable support for
> > incremental backend compiles.
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150513/a9b3ea1a/attachment.html>

Teresa Johnson

2015-May-14 13:50 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
<xinliangli at gmail.com> wrote:>
>
> On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <alexr at
leftfield.org>
> wrote:
>>
>> "ELF-wrapped bitcode" seems potentially controversial to me.
>>
>> What about ar, nm, and various ld implementations adds this
requirement?
>> What about the LLVM implementations of these tools is lacking?
>
>
> Sorry I can not parse your questions properly. Can you make it clearer?
Alex is asking what the issue is with ar, nm, ld -r and regular
bitcode that makes using elf-wrapped bitcode easier.

The issue is that generally you need to provide a plugin to these
tools in order for them to understand and handle bitcode files. We'd
like standard tools to work without requiring a plugin as much as
possible. And in some cases we want them to be handled different than
the way bitcode files are handled with the plugin.

nm: Without a plugin, normal bitcode files are inscrutable. When
provided the gold plugin it can emit the symbols.

ar: Without a plugin, it will create an archive of bitcode files, but
without an index, so it can't be handled by the linker even with a
plugin on an -flto link. When ar is provided the gold plugin it does
create an index, so the linker + gold plugin handle it appropriately
on an -flto link.

ld -r: Without a plugin, fails when provided bitcode inputs. When
provided the gold plugin, it handles them but compiles them all the
way through to ELF executable instructions via a partial LTO link.
This is where we would like to differ in behavior (while also not
requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
output file to still contain ELF-wrapped bitcode, delaying the LTO
until the full link step.

Let me know if that helps address your concerns.

Thanks,
Teresa
>
> David
>
>>
>>
>> Alex
>>
>> > On May 13, 2015, at 7:44 PM, Teresa Johnson <tejohnson at
google.com>
>> > wrote:
>> >
>> > I've included below an RFC for implementing ThinLTO in LLVM,
looking
>> > forward to feedback and questions.
>> > Thanks!
>> > Teresa
>> >
>> >
>> >
>> > RFC to discuss plans for implementing ThinLTO upstream. Background
can
>> > be found in slides from EuroLLVM 2015:
>> >
>> >
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> > As described in the talk, we have a prototype implementation, and
>> > would like to start staging patches upstream. This RFC describes a
>> > breakdown of the major pieces. We would like to commit upstream
>> > gradually in several stages, with all functionality off by
default.
>> > The core ThinLTO importing support and tuning will require
frequent
>> > change and iteration during testing and tuning, and for that part
we
>> > would like to commit rapidly (off by default). See the proposed
staged
>> > implementation described in the Implementation Plan section.
>> >
>> >
>> > ThinLTO Overview
>> > =============>> >
>> > See the talk slides linked above for more details. The following
is a
>> > high-level overview of the motivation.
>> >
>> > Cross Module Optimization (CMO) is an effective means for
improving
>> > runtime performance, by extending the scope of optimizations
across
>> > source module boundaries. Without CMO, the compiler is limited to
>> > optimizing within the scope of single source modules. Two
solutions
>> > for enabling CMO are Link-Time Optimization (LTO), which is
currently
>> > supported in LLVM and GCC, and Lightweight-Interprocedural
>> > Optimization (LIPO). However, each of these solutions has
limitations
>> > that prevent it from being enabled by default. ThinLTO is a new
>> > approach that attempts to address these limitations, with a goal
of
>> > being enabled more broadly. ThinLTO is designed with many of the
same
>> > principals as LIPO, and therefore its advantages, without any of
its
>> > inherent weakness. Unlike in LIPO where the module group decision
is
>> > made at profile training runtime, ThinLTO makes the decision at
>> > compile time, but in a lazy mode that facilitates large scale
>> > parallelism. The serial linker plugin phase is designed to be
razor
>> > thin and blazingly fast. By default this step only does minimal
>> > preparation work to enable the parallel lazy importing performed
>> > later. ThinLTO aims to be scalable like a regular O2 build,
enabling
>> > CMO on machines without large memory configurations, while also
>> > integrating well with distributed build systems. Results from
early
>> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> > expectations that ThinLTO can scale like O2 while enabling much of
the
>> > CMO performed during a full LTO build.
>> >
>> >
>> > A ThinLTO build is divided into 3 phases, which are referred to in
the
>> > following implementation plan:
>> >
>> > phase-1: IR and Function Summary Generation (-c compile)
>> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> > phase-3: Parallel Backend with Demand-Driven Importing
>> >
>> >
>> > Implementation Plan
>> > ===============>> >
>> > This section gives a high-level breakdown of the ThinLTO support
that
>> > will be added, in roughly the order that the patches would be
staged.
>> > The patches are divided into three stages. The first stage
contains a
>> > minimal amount of preparation work that is not ThinLTO-specific.
The
>> > second stage contains most of the infrastructure for ThinLTO,
which
>> > will be off by default. The third stage includes
>> > enhancements/improvements/tunings that can be performed after the
main
>> > ThinLTO infrastructure is in.
>> >
>> > The second and third implementation stages will initially be very
>> > volatile, requiring a lot of iterations and tuning with large apps
to
>> > get stabilized. Therefore it will be important to do fast commits
for
>> > these implementation stages.
>> >
>> >
>> > 1. Stage 1: Preparation
>> > -------------------------------
>> >
>> > The first planned sets of patches are enablers for ThinLTO work:
>> >
>> >
>> > a. LTO directory structure:
>> >
>> > Restructure the LTO directory to remove circular dependence when
>> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC
pass
>> > within Transforms/IPO, and leverages the LTOModule class for
linking
>> > in functions from modules, IPO then requires the LTO library. This
>> > creates a circular dependence between LTO and IPO. To break that,
we
>> > need to split the lib/LTO directory/library into lib/LTO/CodeGen
and
>> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> > respectively. Only LTOCodeGenerator has a dependence on IPO,
removing
>> > the circular dependence.
>> >
>> >
>> > b. ELF wrapper generation support:
>> >
>> > Implement ELF wrapped bitcode writer. In order to more easily
interact
>> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
phase-1
>> > bitcode wrapped in ELF via the .llvmbc section, along with a
symbol
>> > table. The goal is both to interact with these tools without
requiring
>> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> > linked with “$LD -r” (i.e. the resulting object file should still
>> > contain ELF-wrapped bitcode to enable ThinLTO at the full link
step).
>> > I will send a separate design document for these changes, but the
>> > following is a high-level overview.
>> >
>> > Support was added to LLVM for reading ELF-wrapped bitcode
>> > (http://reviews.llvm.org/rL218078), but there does not yet exist
>> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan
to
>> > add support for optionally generating bitcode in an ELF file
>> > containing a single .llvmbc section holding the bitcode.
Specifically,
>> > the patch would add new options “emit-llvm-bc-elf” (object file)
and
>> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
>> > Eventually these would be automatically triggered under “-fthinlto
-c”
>> > and “-fthinlto -S”, respectively.
>> >
>> > Additionally, a symbol table will be generated in the ELF file,
>> > holding the function symbols within the bitcode. This facilitates
>> > handling archives of the ELF-wrapped bitcode created with $AR,
since
>> > the archive will have a symbol table as well. The archive symbol
table
>> > enables gold to extract and pass to the plugin the constituent
>> > ELF-wrapped bitcode files. To support the concatenated llvmbc
section
>> > generated by “$LD -r”, some handling needs to be added to gold and
to
>> > the backend driver to process each original module’s bitcode.
>> >
>> > The function index/summary will later be added as a special ELF
>> > section alongside the .llvmbc sections.
>> >
>> >
>> > 2. Stage 2: ThinLTO Infrastructure
>> > ----------------------------------------------
>> >
>> > The next set of patches adds the base implementation of the
ThinLTO
>> > infrastructure, specifically those required to make ThinLTO
functional
>> > and generate correct but not necessarily high-performing binaries.
It
>> > also does not include support to make debug support under -g
efficient
>> > with ThinLTO.
>> >
>> >
>> > a. Clang/LLVM/gold linker options:
>> >
>> > An early set of clang/llvm patches is needed to provide options to
>> > enable ThinLTO (off by default), so that the rest of the
>> > implementation can be disabled by default as it is added.
>> > Specifically, clang options -fthinlto (used instead of -flto) will
>> > cause clang to invoke the phase-1 emission of LLVM bitcode and
>> > function summary/index on a compile step, and pass the appropriate
>> > option to the gold plugin on a link step. The -thinlto option will
be
>> > added to the gold plugin and llvm-lto tool to launch the phase-2
thin
>> > archive step. The -thinlto option will also be added to the ‘opt’
tool
>> > to invoke it as a phase-3 parallel backend instance.
>> >
>> >
>> > b. Thin-archive linking support in Gold plugin and llvm-lto:
>> >
>> > Under the new plugin option (see above), the plugin needs to
perform
>> > the phase-2 (thin archive) link which simply emits a combined
function
>> > map from the linked modules, without actually performing the
normal
>> > link. Corresponding support should be added to the standalone
llvm-lto
>> > tool to enable testing/debugging without involving the linker and
>> > plugin.
>> >
>> >
>> > c. ThinLTO backend support:
>> >
>> > Support for invoking a phase-3 backend invocation (including
>> > importing) on a module should be added to the ‘opt’ tool under the
new
>> > option. The main change under the option is to instantiate a
Linker
>> > object used to manage the process of linking imported functions
into
>> > the module, efficient read of the combined function map, and
enable
>> > the ThinLTO import pass.
>> >
>> >
>> > d. Function index/summary support:
>> >
>> > This includes infrastructure for writing and reading the function
>> > index/summary section. As noted earlier this will be encoded in a
>> > special ELF section within the module, alongside the .llvmbc
section
>> > containing the bitcode. The thin archive generated by phase-2 of
>> > ThinLTO simply contains all of the function index/summary sections
>> > across the linked modules, organized for efficient function
lookup.
>> >
>> > Each function available for importing from the module contains an
>> > entry in the module’s function index/summary section and in the
>> > resulting combined function map. Each function entry contains that
>> > function’s offset within the bitcode file, used to efficiently
locate
>> > and quickly import just that function. The entry also contains
summary
>> > information (e.g. basic information determined during parsing such
as
>> > the number of instructions in the function), that will be used to
help
>> > guide later import decisions. Because the contents of this section
>> > will change frequently during ThinLTO tuning, it should also be
marked
>> > with a version id for backwards compatibility or version checking.
>> >
>> >
>> > e. ThinLTO importing support:
>> >
>> > Support for the mechanics of importing functions from other
modules,
>> > which can go in gradually as a set of patches since it will be off
by
>> > default. Separate patches can include:
>> >
>> > - BitcodeReader changes to use function index to
import/deserialize
>> > single function of interest (small changes, leverages existing
lazy
>> > streamer support).
>> >
>> > - Minor LTOModule changes to pass the ThinLTO function to import
and
>> > its index into bitcode reader.
>> >
>> > - Marking of imported functions (for use in ThinLTO-specific
symbol
>> > linking and global DCE, for example). This can be in-memory
initially,
>> > but IR support may be required in order to support streaming
bitcode
>> > out and back in again after importing.
>> >
>> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> > static promotion when necessary. The linkage type of imported
>> > functions changes to AvailableExternallyLinkage, for example.
Statics
>> > must be promoted in certain cases, and renamed in consistent ways.
>> >
>> > - GlobalDCE changes to support removing imported functions that
were
>> > not inlined (very small changes to existing pass logic).
>> >
>> >
>> > f. ThinLTO Import Driver SCC pass:
>> >
>> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO
via
>> > an SCC pass, enabled only under -fthinlto options. The pass
includes
>> > utilizing the thin archive (global function index/summary), import
>> > decision heuristics, invocation of LTOModule/ModuleLinker routines
>> > that perform the import, and any necessary callgraph updates and
>> > verification.
>> >
>> >
>> > g. Backend Driver:
>> >
>> > For a single node build, the gold plugin can simply write a
makefile
>> > and fork the parallel backend instances directly via parallel
make.
>> >
>> >
>> > 3. Stage 3: ThinLTO Tuning and Enhancements
>> > ----------------------------------------------------------------
>> >
>> > This refers to the patches that are not required for ThinLTO to
work,
>> > but rather to improve compile time, memory, run-time performance
and
>> > usability.
>> >
>> >
>> > a. Lazy Debug Metadata Linking:
>> >
>> > The prototype implementation included lazy importing of
module-level
>> > metadata during the ThinLTO pass finalization (i.e. after all
function
>> > importing is complete). This actually applies to all module-level
>> > metadata, not just debug, although it is the largest. This can be
>> > added as a separate set of patches. Changes to BitcodeReader,
>> > ValueMapper, ModuleLinker
>> >
>> >
>> > b. Import Tuning:
>> >
>> > Tuning the import strategy will be an iterative process that will
>> > continue to be refined over time. It involves several different
types
>> > of changes: adding support for recording additional metrics in the
>> > function summary, such as profile data and optional heavier-weight
IPA
>> > analyses, and tuning the import heuristics based on the summary
and
>> > callsite context.
>> >
>> >
>> > c. Combined Function Map Pruning:
>> >
>> > The combined function map can be pruned of functions that are
unlikely
>> > to benefit from being imported. For example, during the phase-2
thin
>> > archive plug step we can safely omit large and (with profile data)
>> > cold functions, which are unlikely to benefit from being inlined.
>> > Additionally, all but one copy of comdat functions can be
suppressed.
>> >
>> >
>> > d. Distributed Build System Integration:
>> >
>> > For a distributed build system, the gold plugin should write the
>> > parallel backend invocations into a makefile, including the
mapping
>> > from the IR file to the real object file path, and exit.
Additional
>> > work needs to be done in the distributed build system itself to
>> > distribute and dispatch the parallel backend jobs to the build
>> > cluster.
>> >
>> >
>> > e. Dependence Tracking and Incremental Compiles:
>> >
>> > In order to support build systems that stage from local disks or
>> > network storage, the plugin will optionally support computation of
>> > dependent sets of IR files that each module may import from. This
can
>> > be computed from profile data, if it exists, or from the symbol
table
>> > and heuristics if not. These dependence sets also enable support
for
>> > incremental backend compiles.
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>


-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

Eric Christopher

2015-May-14 14:22 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

So, what Alex is saying is that we have these tools as well and they
understand bitcode just fine, as well as every object format - not just
ELF. :)

-eric

On Thu, May 14, 2015, 6:55 AM Teresa Johnson <tejohnson at google.com>
wrote:
> On Wed, May 13, 2015 at 11:23 PM, Xinliang David Li
> <xinliangli at gmail.com> wrote:
> >
> >
> > On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <alexr at
leftfield.org>
> > wrote:
> >>
> >> "ELF-wrapped bitcode" seems potentially controversial to
me.
> >>
> >> What about ar, nm, and various ld implementations adds this
requirement?
> >> What about the LLVM implementations of these tools is lacking?
> >
> >
> > Sorry I can not parse your questions properly. Can you make it
clearer?
>
> Alex is asking what the issue is with ar, nm, ld -r and regular
> bitcode that makes using elf-wrapped bitcode easier.
>
> The issue is that generally you need to provide a plugin to these
> tools in order for them to understand and handle bitcode files. We'd
> like standard tools to work without requiring a plugin as much as
> possible. And in some cases we want them to be handled different than
> the way bitcode files are handled with the plugin.
>
> nm: Without a plugin, normal bitcode files are inscrutable. When
> provided the gold plugin it can emit the symbols.
>
> ar: Without a plugin, it will create an archive of bitcode files, but
> without an index, so it can't be handled by the linker even with a
> plugin on an -flto link. When ar is provided the gold plugin it does
> create an index, so the linker + gold plugin handle it appropriately
> on an -flto link.
>
> ld -r: Without a plugin, fails when provided bitcode inputs. When
> provided the gold plugin, it handles them but compiles them all the
> way through to ELF executable instructions via a partial LTO link.
> This is where we would like to differ in behavior (while also not
> requiring a plugin) with ELF-wrapped bitcode: we would like the ld -r
> output file to still contain ELF-wrapped bitcode, delaying the LTO
> until the full link step.
>
> Let me know if that helps address your concerns.
>
> Thanks,
> Teresa
>
> >
> > David
> >
> >>
> >>
> >> Alex
> >>
> >> > On May 13, 2015, at 7:44 PM, Teresa Johnson <tejohnson at
google.com>
> >> > wrote:
> >> >
> >> > I've included below an RFC for implementing ThinLTO in
LLVM, looking
> >> > forward to feedback and questions.
> >> > Thanks!
> >> > Teresa
> >> >
> >> >
> >> >
> >> > RFC to discuss plans for implementing ThinLTO upstream.
Background can
> >> > be found in slides from EuroLLVM 2015:
> >> >
> >> >
>
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> >> > As described in the talk, we have a prototype implementation,
and
> >> > would like to start staging patches upstream. This RFC
describes a
> >> > breakdown of the major pieces. We would like to commit
upstream
> >> > gradually in several stages, with all functionality off by
default.
> >> > The core ThinLTO importing support and tuning will require
frequent
> >> > change and iteration during testing and tuning, and for that
part we
> >> > would like to commit rapidly (off by default). See the
proposed staged
> >> > implementation described in the Implementation Plan section.
> >> >
> >> >
> >> > ThinLTO Overview
> >> > =============> >> >
> >> > See the talk slides linked above for more details. The
following is a
> >> > high-level overview of the motivation.
> >> >
> >> > Cross Module Optimization (CMO) is an effective means for
improving
> >> > runtime performance, by extending the scope of optimizations
across
> >> > source module boundaries. Without CMO, the compiler is
limited to
> >> > optimizing within the scope of single source modules. Two
solutions
> >> > for enabling CMO are Link-Time Optimization (LTO), which is
currently
> >> > supported in LLVM and GCC, and Lightweight-Interprocedural
> >> > Optimization (LIPO). However, each of these solutions has
limitations
> >> > that prevent it from being enabled by default. ThinLTO is a
new
> >> > approach that attempts to address these limitations, with a
goal of
> >> > being enabled more broadly. ThinLTO is designed with many of
the same
> >> > principals as LIPO, and therefore its advantages, without any
of its
> >> > inherent weakness. Unlike in LIPO where the module group
decision is
> >> > made at profile training runtime, ThinLTO makes the decision
at
> >> > compile time, but in a lazy mode that facilitates large scale
> >> > parallelism. The serial linker plugin phase is designed to be
razor
> >> > thin and blazingly fast. By default this step only does
minimal
> >> > preparation work to enable the parallel lazy importing
performed
> >> > later. ThinLTO aims to be scalable like a regular O2 build,
enabling
> >> > CMO on machines without large memory configurations, while
also
> >> > integrating well with distributed build systems. Results from
early
> >> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
> >> > expectations that ThinLTO can scale like O2 while enabling
much of the
> >> > CMO performed during a full LTO build.
> >> >
> >> >
> >> > A ThinLTO build is divided into 3 phases, which are referred
to in the
> >> > following implementation plan:
> >> >
> >> > phase-1: IR and Function Summary Generation (-c compile)
> >> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
> >> > phase-3: Parallel Backend with Demand-Driven Importing
> >> >
> >> >
> >> > Implementation Plan
> >> > ===============> >> >
> >> > This section gives a high-level breakdown of the ThinLTO
support that
> >> > will be added, in roughly the order that the patches would be
staged.
> >> > The patches are divided into three stages. The first stage
contains a
> >> > minimal amount of preparation work that is not
ThinLTO-specific. The
> >> > second stage contains most of the infrastructure for ThinLTO,
which
> >> > will be off by default. The third stage includes
> >> > enhancements/improvements/tunings that can be performed after
the main
> >> > ThinLTO infrastructure is in.
> >> >
> >> > The second and third implementation stages will initially be
very
> >> > volatile, requiring a lot of iterations and tuning with large
apps to
> >> > get stabilized. Therefore it will be important to do fast
commits for
> >> > these implementation stages.
> >> >
> >> >
> >> > 1. Stage 1: Preparation
> >> > -------------------------------
> >> >
> >> > The first planned sets of patches are enablers for ThinLTO
work:
> >> >
> >> >
> >> > a. LTO directory structure:
> >> >
> >> > Restructure the LTO directory to remove circular dependence
when
> >> > ThinLTO pass added. Because ThinLTO is being implemented as a
SCC pass
> >> > within Transforms/IPO, and leverages the LTOModule class for
linking
> >> > in functions from modules, IPO then requires the LTO library.
This
> >> > creates a circular dependence between LTO and IPO. To break
that, we
> >> > need to split the lib/LTO directory/library into
lib/LTO/CodeGen and
> >> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> >> > respectively. Only LTOCodeGenerator has a dependence on IPO,
removing
> >> > the circular dependence.
> >> >
> >> >
> >> > b. ELF wrapper generation support:
> >> >
> >> > Implement ELF wrapped bitcode writer. In order to more easily
interact
> >> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the
phase-1
> >> > bitcode wrapped in ELF via the .llvmbc section, along with a
symbol
> >> > table. The goal is both to interact with these tools without
requiring
> >> > a plugin, and also to avoid doing partial LTO/ThinLTO across
files
> >> > linked with “$LD -r” (i.e. the resulting object file should
still
> >> > contain ELF-wrapped bitcode to enable ThinLTO at the full
link step).
> >> > I will send a separate design document for these changes, but
the
> >> > following is a high-level overview.
> >> >
> >> > Support was added to LLVM for reading ELF-wrapped bitcode
> >> > (http://reviews.llvm.org/rL218078), but there does not yet
exist
> >> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I
plan to
> >> > add support for optionally generating bitcode in an ELF file
> >> > containing a single .llvmbc section holding the bitcode.
Specifically,
> >> > the patch would add new options “emit-llvm-bc-elf” (object
file) and
> >> > corresponding “emit-llvm-elf” (textual assembly code
equivalent).
> >> > Eventually these would be automatically triggered under
“-fthinlto -c”
> >> > and “-fthinlto -S”, respectively.
> >> >
> >> > Additionally, a symbol table will be generated in the ELF
file,
> >> > holding the function symbols within the bitcode. This
facilitates
> >> > handling archives of the ELF-wrapped bitcode created with
$AR, since
> >> > the archive will have a symbol table as well. The archive
symbol table
> >> > enables gold to extract and pass to the plugin the
constituent
> >> > ELF-wrapped bitcode files. To support the concatenated llvmbc
section
> >> > generated by “$LD -r”, some handling needs to be added to
gold and to
> >> > the backend driver to process each original module’s bitcode.
> >> >
> >> > The function index/summary will later be added as a special
ELF
> >> > section alongside the .llvmbc sections.
> >> >
> >> >
> >> > 2. Stage 2: ThinLTO Infrastructure
> >> > ----------------------------------------------
> >> >
> >> > The next set of patches adds the base implementation of the
ThinLTO
> >> > infrastructure, specifically those required to make ThinLTO
functional
> >> > and generate correct but not necessarily high-performing
binaries. It
> >> > also does not include support to make debug support under -g
efficient
> >> > with ThinLTO.
> >> >
> >> >
> >> > a. Clang/LLVM/gold linker options:
> >> >
> >> > An early set of clang/llvm patches is needed to provide
options to
> >> > enable ThinLTO (off by default), so that the rest of the
> >> > implementation can be disabled by default as it is added.
> >> > Specifically, clang options -fthinlto (used instead of -flto)
will
> >> > cause clang to invoke the phase-1 emission of LLVM bitcode
and
> >> > function summary/index on a compile step, and pass the
appropriate
> >> > option to the gold plugin on a link step. The -thinlto option
will be
> >> > added to the gold plugin and llvm-lto tool to launch the
phase-2 thin
> >> > archive step. The -thinlto option will also be added to the
‘opt’ tool
> >> > to invoke it as a phase-3 parallel backend instance.
> >> >
> >> >
> >> > b. Thin-archive linking support in Gold plugin and llvm-lto:
> >> >
> >> > Under the new plugin option (see above), the plugin needs to
perform
> >> > the phase-2 (thin archive) link which simply emits a combined
function
> >> > map from the linked modules, without actually performing the
normal
> >> > link. Corresponding support should be added to the standalone
llvm-lto
> >> > tool to enable testing/debugging without involving the linker
and
> >> > plugin.
> >> >
> >> >
> >> > c. ThinLTO backend support:
> >> >
> >> > Support for invoking a phase-3 backend invocation (including
> >> > importing) on a module should be added to the ‘opt’ tool
under the new
> >> > option. The main change under the option is to instantiate a
Linker
> >> > object used to manage the process of linking imported
functions into
> >> > the module, efficient read of the combined function map, and
enable
> >> > the ThinLTO import pass.
> >> >
> >> >
> >> > d. Function index/summary support:
> >> >
> >> > This includes infrastructure for writing and reading the
function
> >> > index/summary section. As noted earlier this will be encoded
in a
> >> > special ELF section within the module, alongside the .llvmbc
section
> >> > containing the bitcode. The thin archive generated by phase-2
of
> >> > ThinLTO simply contains all of the function index/summary
sections
> >> > across the linked modules, organized for efficient function
lookup.
> >> >
> >> > Each function available for importing from the module
contains an
> >> > entry in the module’s function index/summary section and in
the
> >> > resulting combined function map. Each function entry contains
that
> >> > function’s offset within the bitcode file, used to
efficiently locate
> >> > and quickly import just that function. The entry also
contains summary
> >> > information (e.g. basic information determined during parsing
such as
> >> > the number of instructions in the function), that will be
used to help
> >> > guide later import decisions. Because the contents of this
section
> >> > will change frequently during ThinLTO tuning, it should also
be marked
> >> > with a version id for backwards compatibility or version
checking.
> >> >
> >> >
> >> > e. ThinLTO importing support:
> >> >
> >> > Support for the mechanics of importing functions from other
modules,
> >> > which can go in gradually as a set of patches since it will
be off by
> >> > default. Separate patches can include:
> >> >
> >> > - BitcodeReader changes to use function index to
import/deserialize
> >> > single function of interest (small changes, leverages
existing lazy
> >> > streamer support).
> >> >
> >> > - Minor LTOModule changes to pass the ThinLTO function to
import and
> >> > its index into bitcode reader.
> >> >
> >> > - Marking of imported functions (for use in ThinLTO-specific
symbol
> >> > linking and global DCE, for example). This can be in-memory
initially,
> >> > but IR support may be required in order to support streaming
bitcode
> >> > out and back in again after importing.
> >> >
> >> > - ModuleLinker changes to do ThinLTO-specific symbol linking
and
> >> > static promotion when necessary. The linkage type of imported
> >> > functions changes to AvailableExternallyLinkage, for example.
Statics
> >> > must be promoted in certain cases, and renamed in consistent
ways.
> >> >
> >> > - GlobalDCE changes to support removing imported functions
that were
> >> > not inlined (very small changes to existing pass logic).
> >> >
> >> >
> >> > f. ThinLTO Import Driver SCC pass:
> >> >
> >> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing
ThinLTO via
> >> > an SCC pass, enabled only under -fthinlto options. The pass
includes
> >> > utilizing the thin archive (global function index/summary),
import
> >> > decision heuristics, invocation of LTOModule/ModuleLinker
routines
> >> > that perform the import, and any necessary callgraph updates
and
> >> > verification.
> >> >
> >> >
> >> > g. Backend Driver:
> >> >
> >> > For a single node build, the gold plugin can simply write a
makefile
> >> > and fork the parallel backend instances directly via parallel
make.
> >> >
> >> >
> >> > 3. Stage 3: ThinLTO Tuning and Enhancements
> >> >
----------------------------------------------------------------
> >> >
> >> > This refers to the patches that are not required for ThinLTO
to work,
> >> > but rather to improve compile time, memory, run-time
performance and
> >> > usability.
> >> >
> >> >
> >> > a. Lazy Debug Metadata Linking:
> >> >
> >> > The prototype implementation included lazy importing of
module-level
> >> > metadata during the ThinLTO pass finalization (i.e. after all
function
> >> > importing is complete). This actually applies to all
module-level
> >> > metadata, not just debug, although it is the largest. This
can be
> >> > added as a separate set of patches. Changes to BitcodeReader,
> >> > ValueMapper, ModuleLinker
> >> >
> >> >
> >> > b. Import Tuning:
> >> >
> >> > Tuning the import strategy will be an iterative process that
will
> >> > continue to be refined over time. It involves several
different types
> >> > of changes: adding support for recording additional metrics
in the
> >> > function summary, such as profile data and optional
heavier-weight IPA
> >> > analyses, and tuning the import heuristics based on the
summary and
> >> > callsite context.
> >> >
> >> >
> >> > c. Combined Function Map Pruning:
> >> >
> >> > The combined function map can be pruned of functions that are
unlikely
> >> > to benefit from being imported. For example, during the
phase-2 thin
> >> > archive plug step we can safely omit large and (with profile
data)
> >> > cold functions, which are unlikely to benefit from being
inlined.
> >> > Additionally, all but one copy of comdat functions can be
suppressed.
> >> >
> >> >
> >> > d. Distributed Build System Integration:
> >> >
> >> > For a distributed build system, the gold plugin should write
the
> >> > parallel backend invocations into a makefile, including the
mapping
> >> > from the IR file to the real object file path, and exit.
Additional
> >> > work needs to be done in the distributed build system itself
to
> >> > distribute and dispatch the parallel backend jobs to the
build
> >> > cluster.
> >> >
> >> >
> >> > e. Dependence Tracking and Incremental Compiles:
> >> >
> >> > In order to support build systems that stage from local disks
or
> >> > network storage, the plugin will optionally support
computation of
> >> > dependent sets of IR files that each module may import from.
This can
> >> > be computed from profile data, if it exists, or from the
symbol table
> >> > and heuristics if not. These dependence sets also enable
support for
> >> > incremental backend compiles.
> >> >
> >> >
> >> >
> >> > --
> >> > Teresa Johnson | Software Engineer | tejohnson at google.com
|
> 408-460-2413
> >> >
> >> > _______________________________________________
> >> > LLVM Developers mailing list
> >> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
> >
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150514/02fe50cc/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2015 - [LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

Possibly Parallel Threads