thr3ads.net - llvm dev - [LLVMdev] RFC: ThinLTO Impementation Plan [May 2015]

If this information is useful, please help other people find it:
Share via:

Teresa Johnson

2015-May-13 18:44 UTC

[LLVMdev] RFC: ThinLTO Impementation Plan

I've included below an RFC for implementing ThinLTO in LLVM, looking
forward to feedback and questions.
Thanks!
Teresa



RFC to discuss plans for implementing ThinLTO upstream. Background can
be found in slides from EuroLLVM 2015:
   https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
As described in the talk, we have a prototype implementation, and
would like to start staging patches upstream. This RFC describes a
breakdown of the major pieces. We would like to commit upstream
gradually in several stages, with all functionality off by default.
The core ThinLTO importing support and tuning will require frequent
change and iteration during testing and tuning, and for that part we
would like to commit rapidly (off by default). See the proposed staged
implementation described in the Implementation Plan section.


ThinLTO Overview
=============
See the talk slides linked above for more details. The following is a
high-level overview of the motivation.

Cross Module Optimization (CMO) is an effective means for improving
runtime performance, by extending the scope of optimizations across
source module boundaries. Without CMO, the compiler is limited to
optimizing within the scope of single source modules. Two solutions
for enabling CMO are Link-Time Optimization (LTO), which is currently
supported in LLVM and GCC, and Lightweight-Interprocedural
Optimization (LIPO). However, each of these solutions has limitations
that prevent it from being enabled by default. ThinLTO is a new
approach that attempts to address these limitations, with a goal of
being enabled more broadly. ThinLTO is designed with many of the same
principals as LIPO, and therefore its advantages, without any of its
inherent weakness. Unlike in LIPO where the module group decision is
made at profile training runtime, ThinLTO makes the decision at
compile time, but in a lazy mode that facilitates large scale
parallelism. The serial linker plugin phase is designed to be razor
thin and blazingly fast. By default this step only does minimal
preparation work to enable the parallel lazy importing performed
later. ThinLTO aims to be scalable like a regular O2 build, enabling
CMO on machines without large memory configurations, while also
integrating well with distributed build systems. Results from early
prototyping on SPEC cpu2006 C++ benchmarks are in line with
expectations that ThinLTO can scale like O2 while enabling much of the
CMO performed during a full LTO build.


A ThinLTO build is divided into 3 phases, which are referred to in the
following implementation plan:

phase-1: IR and Function Summary Generation (-c compile)
phase-2: Thin Linker Plugin Layer (thin archive linker step)
phase-3: Parallel Backend with Demand-Driven Importing


Implementation Plan
===============
This section gives a high-level breakdown of the ThinLTO support that
will be added, in roughly the order that the patches would be staged.
The patches are divided into three stages. The first stage contains a
minimal amount of preparation work that is not ThinLTO-specific. The
second stage contains most of the infrastructure for ThinLTO, which
will be off by default. The third stage includes
enhancements/improvements/tunings that can be performed after the main
ThinLTO infrastructure is in.

The second and third implementation stages will initially be very
volatile, requiring a lot of iterations and tuning with large apps to
get stabilized. Therefore it will be important to do fast commits for
these implementation stages.


1. Stage 1: Preparation
-------------------------------

The first planned sets of patches are enablers for ThinLTO work:


a. LTO directory structure:

Restructure the LTO directory to remove circular dependence when
ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
within Transforms/IPO, and leverages the LTOModule class for linking
in functions from modules, IPO then requires the LTO library. This
creates a circular dependence between LTO and IPO. To break that, we
need to split the lib/LTO directory/library into lib/LTO/CodeGen and
lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
respectively. Only LTOCodeGenerator has a dependence on IPO, removing
the circular dependence.


b. ELF wrapper generation support:

Implement ELF wrapped bitcode writer. In order to more easily interact
with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
bitcode wrapped in ELF via the .llvmbc section, along with a symbol
table. The goal is both to interact with these tools without requiring
a plugin, and also to avoid doing partial LTO/ThinLTO across files
linked with “$LD -r” (i.e. the resulting object file should still
contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
I will send a separate design document for these changes, but the
following is a high-level overview.

Support was added to LLVM for reading ELF-wrapped bitcode
(http://reviews.llvm.org/rL218078), but there does not yet exist
support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
add support for optionally generating bitcode in an ELF file
containing a single .llvmbc section holding the bitcode. Specifically,
the patch would add new options “emit-llvm-bc-elf” (object file) and
corresponding “emit-llvm-elf” (textual assembly code equivalent).
Eventually these would be automatically triggered under “-fthinlto -c”
and “-fthinlto -S”, respectively.

Additionally, a symbol table will be generated in the ELF file,
holding the function symbols within the bitcode. This facilitates
handling archives of the ELF-wrapped bitcode created with $AR, since
the archive will have a symbol table as well. The archive symbol table
enables gold to extract and pass to the plugin the constituent
ELF-wrapped bitcode files. To support the concatenated llvmbc section
generated by “$LD -r”, some handling needs to be added to gold and to
the backend driver to process each original module’s bitcode.

The function index/summary will later be added as a special ELF
section alongside the .llvmbc sections.


2. Stage 2: ThinLTO Infrastructure
----------------------------------------------

The next set of patches adds the base implementation of the ThinLTO
infrastructure, specifically those required to make ThinLTO functional
and generate correct but not necessarily high-performing binaries. It
also does not include support to make debug support under -g efficient
with ThinLTO.


a. Clang/LLVM/gold linker options:

An early set of clang/llvm patches is needed to provide options to
enable ThinLTO (off by default), so that the rest of the
implementation can be disabled by default as it is added.
Specifically, clang options -fthinlto (used instead of -flto) will
cause clang to invoke the phase-1 emission of LLVM bitcode and
function summary/index on a compile step, and pass the appropriate
option to the gold plugin on a link step. The -thinlto option will be
added to the gold plugin and llvm-lto tool to launch the phase-2 thin
archive step. The -thinlto option will also be added to the ‘opt’ tool
to invoke it as a phase-3 parallel backend instance.


b. Thin-archive linking support in Gold plugin and llvm-lto:

Under the new plugin option (see above), the plugin needs to perform
the phase-2 (thin archive) link which simply emits a combined function
map from the linked modules, without actually performing the normal
link. Corresponding support should be added to the standalone llvm-lto
tool to enable testing/debugging without involving the linker and
plugin.


c. ThinLTO backend support:

Support for invoking a phase-3 backend invocation (including
importing) on a module should be added to the ‘opt’ tool under the new
option. The main change under the option is to instantiate a Linker
object used to manage the process of linking imported functions into
the module, efficient read of the combined function map, and enable
the ThinLTO import pass.


d. Function index/summary support:

This includes infrastructure for writing and reading the function
index/summary section. As noted earlier this will be encoded in a
special ELF section within the module, alongside the .llvmbc section
containing the bitcode. The thin archive generated by phase-2 of
ThinLTO simply contains all of the function index/summary sections
across the linked modules, organized for efficient function lookup.

Each function available for importing from the module contains an
entry in the module’s function index/summary section and in the
resulting combined function map. Each function entry contains that
function’s offset within the bitcode file, used to efficiently locate
and quickly import just that function. The entry also contains summary
information (e.g. basic information determined during parsing such as
the number of instructions in the function), that will be used to help
guide later import decisions. Because the contents of this section
will change frequently during ThinLTO tuning, it should also be marked
with a version id for backwards compatibility or version checking.


e. ThinLTO importing support:

Support for the mechanics of importing functions from other modules,
which can go in gradually as a set of patches since it will be off by
default. Separate patches can include:

- BitcodeReader changes to use function index to import/deserialize
single function of interest (small changes, leverages existing lazy
streamer support).

- Minor LTOModule changes to pass the ThinLTO function to import and
its index into bitcode reader.

- Marking of imported functions (for use in ThinLTO-specific symbol
linking and global DCE, for example). This can be in-memory initially,
but IR support may be required in order to support streaming bitcode
out and back in again after importing.

- ModuleLinker changes to do ThinLTO-specific symbol linking and
static promotion when necessary. The linkage type of imported
functions changes to AvailableExternallyLinkage, for example. Statics
must be promoted in certain cases, and renamed in consistent ways.

- GlobalDCE changes to support removing imported functions that were
not inlined (very small changes to existing pass logic).


f. ThinLTO Import Driver SCC pass:

Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
an SCC pass, enabled only under -fthinlto options. The pass includes
utilizing the thin archive (global function index/summary), import
decision heuristics, invocation of LTOModule/ModuleLinker routines
that perform the import, and any necessary callgraph updates and
verification.


g. Backend Driver:

For a single node build, the gold plugin can simply write a makefile
and fork the parallel backend instances directly via parallel make.


3. Stage 3: ThinLTO Tuning and Enhancements
----------------------------------------------------------------

This refers to the patches that are not required for ThinLTO to work,
but rather to improve compile time, memory, run-time performance and
usability.


a. Lazy Debug Metadata Linking:

The prototype implementation included lazy importing of module-level
metadata during the ThinLTO pass finalization (i.e. after all function
importing is complete). This actually applies to all module-level
metadata, not just debug, although it is the largest. This can be
added as a separate set of patches. Changes to BitcodeReader,
ValueMapper, ModuleLinker


b. Import Tuning:

Tuning the import strategy will be an iterative process that will
continue to be refined over time. It involves several different types
of changes: adding support for recording additional metrics in the
function summary, such as profile data and optional heavier-weight IPA
analyses, and tuning the import heuristics based on the summary and
callsite context.


c. Combined Function Map Pruning:

The combined function map can be pruned of functions that are unlikely
to benefit from being imported. For example, during the phase-2 thin
archive plug step we can safely omit large and (with profile data)
cold functions, which are unlikely to benefit from being inlined.
Additionally, all but one copy of comdat functions can be suppressed.


d. Distributed Build System Integration:

For a distributed build system, the gold plugin should write the
parallel backend invocations into a makefile, including the mapping
from the IR file to the real object file path, and exit. Additional
work needs to be done in the distributed build system itself to
distribute and dispatch the parallel backend jobs to the build
cluster.


e. Dependence Tracking and Incremental Compiles:

In order to support build systems that stage from local disks or
network storage, the plugin will optionally support computation of
dependent sets of IR files that each module may import from. This can
be computed from profile data, if it exists, or from the symbol table
and heuristics if not. These dependence sets also enable support for
incremental backend compiles.



-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

Alex Rosenberg

2015-May-14 05:46 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

"ELF-wrapped bitcode" seems potentially controversial to me.

What about ar, nm, and various ld implementations adds this requirement? What
about the LLVM implementations of these tools is lacking?

Alex
> On May 13, 2015, at 7:44 PM, Teresa Johnson <tejohnson at google.com>
wrote:
> 
> I've included below an RFC for implementing ThinLTO in LLVM, looking
> forward to feedback and questions.
> Thanks!
> Teresa
> 
> 
> 
> RFC to discuss plans for implementing ThinLTO upstream. Background can
> be found in slides from EuroLLVM 2015:
>  
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> As described in the talk, we have a prototype implementation, and
> would like to start staging patches upstream. This RFC describes a
> breakdown of the major pieces. We would like to commit upstream
> gradually in several stages, with all functionality off by default.
> The core ThinLTO importing support and tuning will require frequent
> change and iteration during testing and tuning, and for that part we
> would like to commit rapidly (off by default). See the proposed staged
> implementation described in the Implementation Plan section.
> 
> 
> ThinLTO Overview
> =============> 
> See the talk slides linked above for more details. The following is a
> high-level overview of the motivation.
> 
> Cross Module Optimization (CMO) is an effective means for improving
> runtime performance, by extending the scope of optimizations across
> source module boundaries. Without CMO, the compiler is limited to
> optimizing within the scope of single source modules. Two solutions
> for enabling CMO are Link-Time Optimization (LTO), which is currently
> supported in LLVM and GCC, and Lightweight-Interprocedural
> Optimization (LIPO). However, each of these solutions has limitations
> that prevent it from being enabled by default. ThinLTO is a new
> approach that attempts to address these limitations, with a goal of
> being enabled more broadly. ThinLTO is designed with many of the same
> principals as LIPO, and therefore its advantages, without any of its
> inherent weakness. Unlike in LIPO where the module group decision is
> made at profile training runtime, ThinLTO makes the decision at
> compile time, but in a lazy mode that facilitates large scale
> parallelism. The serial linker plugin phase is designed to be razor
> thin and blazingly fast. By default this step only does minimal
> preparation work to enable the parallel lazy importing performed
> later. ThinLTO aims to be scalable like a regular O2 build, enabling
> CMO on machines without large memory configurations, while also
> integrating well with distributed build systems. Results from early
> prototyping on SPEC cpu2006 C++ benchmarks are in line with
> expectations that ThinLTO can scale like O2 while enabling much of the
> CMO performed during a full LTO build.
> 
> 
> A ThinLTO build is divided into 3 phases, which are referred to in the
> following implementation plan:
> 
> phase-1: IR and Function Summary Generation (-c compile)
> phase-2: Thin Linker Plugin Layer (thin archive linker step)
> phase-3: Parallel Backend with Demand-Driven Importing
> 
> 
> Implementation Plan
> ===============> 
> This section gives a high-level breakdown of the ThinLTO support that
> will be added, in roughly the order that the patches would be staged.
> The patches are divided into three stages. The first stage contains a
> minimal amount of preparation work that is not ThinLTO-specific. The
> second stage contains most of the infrastructure for ThinLTO, which
> will be off by default. The third stage includes
> enhancements/improvements/tunings that can be performed after the main
> ThinLTO infrastructure is in.
> 
> The second and third implementation stages will initially be very
> volatile, requiring a lot of iterations and tuning with large apps to
> get stabilized. Therefore it will be important to do fast commits for
> these implementation stages.
> 
> 
> 1. Stage 1: Preparation
> -------------------------------
> 
> The first planned sets of patches are enablers for ThinLTO work:
> 
> 
> a. LTO directory structure:
> 
> Restructure the LTO directory to remove circular dependence when
> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> within Transforms/IPO, and leverages the LTOModule class for linking
> in functions from modules, IPO then requires the LTO library. This
> creates a circular dependence between LTO and IPO. To break that, we
> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> the circular dependence.
> 
> 
> b. ELF wrapper generation support:
> 
> Implement ELF wrapped bitcode writer. In order to more easily interact
> with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> table. The goal is both to interact with these tools without requiring
> a plugin, and also to avoid doing partial LTO/ThinLTO across files
> linked with “$LD -r” (i.e. the resulting object file should still
> contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
> I will send a separate design document for these changes, but the
> following is a high-level overview.
> 
> Support was added to LLVM for reading ELF-wrapped bitcode
> (http://reviews.llvm.org/rL218078), but there does not yet exist
> support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> add support for optionally generating bitcode in an ELF file
> containing a single .llvmbc section holding the bitcode. Specifically,
> the patch would add new options “emit-llvm-bc-elf” (object file) and
> corresponding “emit-llvm-elf” (textual assembly code equivalent).
> Eventually these would be automatically triggered under “-fthinlto -c”
> and “-fthinlto -S”, respectively.
> 
> Additionally, a symbol table will be generated in the ELF file,
> holding the function symbols within the bitcode. This facilitates
> handling archives of the ELF-wrapped bitcode created with $AR, since
> the archive will have a symbol table as well. The archive symbol table
> enables gold to extract and pass to the plugin the constituent
> ELF-wrapped bitcode files. To support the concatenated llvmbc section
> generated by “$LD -r”, some handling needs to be added to gold and to
> the backend driver to process each original module’s bitcode.
> 
> The function index/summary will later be added as a special ELF
> section alongside the .llvmbc sections.
> 
> 
> 2. Stage 2: ThinLTO Infrastructure
> ----------------------------------------------
> 
> The next set of patches adds the base implementation of the ThinLTO
> infrastructure, specifically those required to make ThinLTO functional
> and generate correct but not necessarily high-performing binaries. It
> also does not include support to make debug support under -g efficient
> with ThinLTO.
> 
> 
> a. Clang/LLVM/gold linker options:
> 
> An early set of clang/llvm patches is needed to provide options to
> enable ThinLTO (off by default), so that the rest of the
> implementation can be disabled by default as it is added.
> Specifically, clang options -fthinlto (used instead of -flto) will
> cause clang to invoke the phase-1 emission of LLVM bitcode and
> function summary/index on a compile step, and pass the appropriate
> option to the gold plugin on a link step. The -thinlto option will be
> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> archive step. The -thinlto option will also be added to the ‘opt’ tool
> to invoke it as a phase-3 parallel backend instance.
> 
> 
> b. Thin-archive linking support in Gold plugin and llvm-lto:
> 
> Under the new plugin option (see above), the plugin needs to perform
> the phase-2 (thin archive) link which simply emits a combined function
> map from the linked modules, without actually performing the normal
> link. Corresponding support should be added to the standalone llvm-lto
> tool to enable testing/debugging without involving the linker and
> plugin.
> 
> 
> c. ThinLTO backend support:
> 
> Support for invoking a phase-3 backend invocation (including
> importing) on a module should be added to the ‘opt’ tool under the new
> option. The main change under the option is to instantiate a Linker
> object used to manage the process of linking imported functions into
> the module, efficient read of the combined function map, and enable
> the ThinLTO import pass.
> 
> 
> d. Function index/summary support:
> 
> This includes infrastructure for writing and reading the function
> index/summary section. As noted earlier this will be encoded in a
> special ELF section within the module, alongside the .llvmbc section
> containing the bitcode. The thin archive generated by phase-2 of
> ThinLTO simply contains all of the function index/summary sections
> across the linked modules, organized for efficient function lookup.
> 
> Each function available for importing from the module contains an
> entry in the module’s function index/summary section and in the
> resulting combined function map. Each function entry contains that
> function’s offset within the bitcode file, used to efficiently locate
> and quickly import just that function. The entry also contains summary
> information (e.g. basic information determined during parsing such as
> the number of instructions in the function), that will be used to help
> guide later import decisions. Because the contents of this section
> will change frequently during ThinLTO tuning, it should also be marked
> with a version id for backwards compatibility or version checking.
> 
> 
> e. ThinLTO importing support:
> 
> Support for the mechanics of importing functions from other modules,
> which can go in gradually as a set of patches since it will be off by
> default. Separate patches can include:
> 
> - BitcodeReader changes to use function index to import/deserialize
> single function of interest (small changes, leverages existing lazy
> streamer support).
> 
> - Minor LTOModule changes to pass the ThinLTO function to import and
> its index into bitcode reader.
> 
> - Marking of imported functions (for use in ThinLTO-specific symbol
> linking and global DCE, for example). This can be in-memory initially,
> but IR support may be required in order to support streaming bitcode
> out and back in again after importing.
> 
> - ModuleLinker changes to do ThinLTO-specific symbol linking and
> static promotion when necessary. The linkage type of imported
> functions changes to AvailableExternallyLinkage, for example. Statics
> must be promoted in certain cases, and renamed in consistent ways.
> 
> - GlobalDCE changes to support removing imported functions that were
> not inlined (very small changes to existing pass logic).
> 
> 
> f. ThinLTO Import Driver SCC pass:
> 
> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> an SCC pass, enabled only under -fthinlto options. The pass includes
> utilizing the thin archive (global function index/summary), import
> decision heuristics, invocation of LTOModule/ModuleLinker routines
> that perform the import, and any necessary callgraph updates and
> verification.
> 
> 
> g. Backend Driver:
> 
> For a single node build, the gold plugin can simply write a makefile
> and fork the parallel backend instances directly via parallel make.
> 
> 
> 3. Stage 3: ThinLTO Tuning and Enhancements
> ----------------------------------------------------------------
> 
> This refers to the patches that are not required for ThinLTO to work,
> but rather to improve compile time, memory, run-time performance and
> usability.
> 
> 
> a. Lazy Debug Metadata Linking:
> 
> The prototype implementation included lazy importing of module-level
> metadata during the ThinLTO pass finalization (i.e. after all function
> importing is complete). This actually applies to all module-level
> metadata, not just debug, although it is the largest. This can be
> added as a separate set of patches. Changes to BitcodeReader,
> ValueMapper, ModuleLinker
> 
> 
> b. Import Tuning:
> 
> Tuning the import strategy will be an iterative process that will
> continue to be refined over time. It involves several different types
> of changes: adding support for recording additional metrics in the
> function summary, such as profile data and optional heavier-weight IPA
> analyses, and tuning the import heuristics based on the summary and
> callsite context.
> 
> 
> c. Combined Function Map Pruning:
> 
> The combined function map can be pruned of functions that are unlikely
> to benefit from being imported. For example, during the phase-2 thin
> archive plug step we can safely omit large and (with profile data)
> cold functions, which are unlikely to benefit from being inlined.
> Additionally, all but one copy of comdat functions can be suppressed.
> 
> 
> d. Distributed Build System Integration:
> 
> For a distributed build system, the gold plugin should write the
> parallel backend invocations into a makefile, including the mapping
> from the IR file to the real object file path, and exit. Additional
> work needs to be done in the distributed build system itself to
> distribute and dispatch the parallel backend jobs to the build
> cluster.
> 
> 
> e. Dependence Tracking and Incremental Compiles:
> 
> In order to support build systems that stage from local disks or
> network storage, the plugin will optionally support computation of
> dependent sets of IR files that each module may import from. This can
> be computed from profile data, if it exists, or from the symbol table
> and heuristics if not. These dependence sets also enable support for
> incremental backend compiles.
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Xinliang David Li

2015-May-14 06:23 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

On Wed, May 13, 2015 at 10:46 PM, Alex Rosenberg <alexr at leftfield.org>
wrote:
> "ELF-wrapped bitcode" seems potentially controversial to me.
>
> What about ar, nm, and various ld implementations adds this requirement?
> What about the LLVM implementations of these tools is lacking?
>
Sorry I can not parse your questions properly. Can you make it clearer?

David

>
> Alex
>
> > On May 13, 2015, at 7:44 PM, Teresa Johnson <tejohnson at
google.com>
> wrote:
> >
> > I've included below an RFC for implementing ThinLTO in LLVM,
looking
> > forward to feedback and questions.
> > Thanks!
> > Teresa
> >
> >
> >
> > RFC to discuss plans for implementing ThinLTO upstream. Background can
> > be found in slides from EuroLLVM 2015:
> >
>
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> > As described in the talk, we have a prototype implementation, and
> > would like to start staging patches upstream. This RFC describes a
> > breakdown of the major pieces. We would like to commit upstream
> > gradually in several stages, with all functionality off by default.
> > The core ThinLTO importing support and tuning will require frequent
> > change and iteration during testing and tuning, and for that part we
> > would like to commit rapidly (off by default). See the proposed staged
> > implementation described in the Implementation Plan section.
> >
> >
> > ThinLTO Overview
> > =============> >
> > See the talk slides linked above for more details. The following is a
> > high-level overview of the motivation.
> >
> > Cross Module Optimization (CMO) is an effective means for improving
> > runtime performance, by extending the scope of optimizations across
> > source module boundaries. Without CMO, the compiler is limited to
> > optimizing within the scope of single source modules. Two solutions
> > for enabling CMO are Link-Time Optimization (LTO), which is currently
> > supported in LLVM and GCC, and Lightweight-Interprocedural
> > Optimization (LIPO). However, each of these solutions has limitations
> > that prevent it from being enabled by default. ThinLTO is a new
> > approach that attempts to address these limitations, with a goal of
> > being enabled more broadly. ThinLTO is designed with many of the same
> > principals as LIPO, and therefore its advantages, without any of its
> > inherent weakness. Unlike in LIPO where the module group decision is
> > made at profile training runtime, ThinLTO makes the decision at
> > compile time, but in a lazy mode that facilitates large scale
> > parallelism. The serial linker plugin phase is designed to be razor
> > thin and blazingly fast. By default this step only does minimal
> > preparation work to enable the parallel lazy importing performed
> > later. ThinLTO aims to be scalable like a regular O2 build, enabling
> > CMO on machines without large memory configurations, while also
> > integrating well with distributed build systems. Results from early
> > prototyping on SPEC cpu2006 C++ benchmarks are in line with
> > expectations that ThinLTO can scale like O2 while enabling much of the
> > CMO performed during a full LTO build.
> >
> >
> > A ThinLTO build is divided into 3 phases, which are referred to in the
> > following implementation plan:
> >
> > phase-1: IR and Function Summary Generation (-c compile)
> > phase-2: Thin Linker Plugin Layer (thin archive linker step)
> > phase-3: Parallel Backend with Demand-Driven Importing
> >
> >
> > Implementation Plan
> > ===============> >
> > This section gives a high-level breakdown of the ThinLTO support that
> > will be added, in roughly the order that the patches would be staged.
> > The patches are divided into three stages. The first stage contains a
> > minimal amount of preparation work that is not ThinLTO-specific. The
> > second stage contains most of the infrastructure for ThinLTO, which
> > will be off by default. The third stage includes
> > enhancements/improvements/tunings that can be performed after the main
> > ThinLTO infrastructure is in.
> >
> > The second and third implementation stages will initially be very
> > volatile, requiring a lot of iterations and tuning with large apps to
> > get stabilized. Therefore it will be important to do fast commits for
> > these implementation stages.
> >
> >
> > 1. Stage 1: Preparation
> > -------------------------------
> >
> > The first planned sets of patches are enablers for ThinLTO work:
> >
> >
> > a. LTO directory structure:
> >
> > Restructure the LTO directory to remove circular dependence when
> > ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> > within Transforms/IPO, and leverages the LTOModule class for linking
> > in functions from modules, IPO then requires the LTO library. This
> > creates a circular dependence between LTO and IPO. To break that, we
> > need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> > lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> > respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> > the circular dependence.
> >
> >
> > b. ELF wrapper generation support:
> >
> > Implement ELF wrapped bitcode writer. In order to more easily interact
> > with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> > bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> > table. The goal is both to interact with these tools without requiring
> > a plugin, and also to avoid doing partial LTO/ThinLTO across files
> > linked with “$LD -r” (i.e. the resulting object file should still
> > contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
> > I will send a separate design document for these changes, but the
> > following is a high-level overview.
> >
> > Support was added to LLVM for reading ELF-wrapped bitcode
> > (http://reviews.llvm.org/rL218078), but there does not yet exist
> > support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> > add support for optionally generating bitcode in an ELF file
> > containing a single .llvmbc section holding the bitcode. Specifically,
> > the patch would add new options “emit-llvm-bc-elf” (object file) and
> > corresponding “emit-llvm-elf” (textual assembly code equivalent).
> > Eventually these would be automatically triggered under “-fthinlto -c”
> > and “-fthinlto -S”, respectively.
> >
> > Additionally, a symbol table will be generated in the ELF file,
> > holding the function symbols within the bitcode. This facilitates
> > handling archives of the ELF-wrapped bitcode created with $AR, since
> > the archive will have a symbol table as well. The archive symbol table
> > enables gold to extract and pass to the plugin the constituent
> > ELF-wrapped bitcode files. To support the concatenated llvmbc section
> > generated by “$LD -r”, some handling needs to be added to gold and to
> > the backend driver to process each original module’s bitcode.
> >
> > The function index/summary will later be added as a special ELF
> > section alongside the .llvmbc sections.
> >
> >
> > 2. Stage 2: ThinLTO Infrastructure
> > ----------------------------------------------
> >
> > The next set of patches adds the base implementation of the ThinLTO
> > infrastructure, specifically those required to make ThinLTO functional
> > and generate correct but not necessarily high-performing binaries. It
> > also does not include support to make debug support under -g efficient
> > with ThinLTO.
> >
> >
> > a. Clang/LLVM/gold linker options:
> >
> > An early set of clang/llvm patches is needed to provide options to
> > enable ThinLTO (off by default), so that the rest of the
> > implementation can be disabled by default as it is added.
> > Specifically, clang options -fthinlto (used instead of -flto) will
> > cause clang to invoke the phase-1 emission of LLVM bitcode and
> > function summary/index on a compile step, and pass the appropriate
> > option to the gold plugin on a link step. The -thinlto option will be
> > added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> > archive step. The -thinlto option will also be added to the ‘opt’ tool
> > to invoke it as a phase-3 parallel backend instance.
> >
> >
> > b. Thin-archive linking support in Gold plugin and llvm-lto:
> >
> > Under the new plugin option (see above), the plugin needs to perform
> > the phase-2 (thin archive) link which simply emits a combined function
> > map from the linked modules, without actually performing the normal
> > link. Corresponding support should be added to the standalone llvm-lto
> > tool to enable testing/debugging without involving the linker and
> > plugin.
> >
> >
> > c. ThinLTO backend support:
> >
> > Support for invoking a phase-3 backend invocation (including
> > importing) on a module should be added to the ‘opt’ tool under the new
> > option. The main change under the option is to instantiate a Linker
> > object used to manage the process of linking imported functions into
> > the module, efficient read of the combined function map, and enable
> > the ThinLTO import pass.
> >
> >
> > d. Function index/summary support:
> >
> > This includes infrastructure for writing and reading the function
> > index/summary section. As noted earlier this will be encoded in a
> > special ELF section within the module, alongside the .llvmbc section
> > containing the bitcode. The thin archive generated by phase-2 of
> > ThinLTO simply contains all of the function index/summary sections
> > across the linked modules, organized for efficient function lookup.
> >
> > Each function available for importing from the module contains an
> > entry in the module’s function index/summary section and in the
> > resulting combined function map. Each function entry contains that
> > function’s offset within the bitcode file, used to efficiently locate
> > and quickly import just that function. The entry also contains summary
> > information (e.g. basic information determined during parsing such as
> > the number of instructions in the function), that will be used to help
> > guide later import decisions. Because the contents of this section
> > will change frequently during ThinLTO tuning, it should also be marked
> > with a version id for backwards compatibility or version checking.
> >
> >
> > e. ThinLTO importing support:
> >
> > Support for the mechanics of importing functions from other modules,
> > which can go in gradually as a set of patches since it will be off by
> > default. Separate patches can include:
> >
> > - BitcodeReader changes to use function index to import/deserialize
> > single function of interest (small changes, leverages existing lazy
> > streamer support).
> >
> > - Minor LTOModule changes to pass the ThinLTO function to import and
> > its index into bitcode reader.
> >
> > - Marking of imported functions (for use in ThinLTO-specific symbol
> > linking and global DCE, for example). This can be in-memory initially,
> > but IR support may be required in order to support streaming bitcode
> > out and back in again after importing.
> >
> > - ModuleLinker changes to do ThinLTO-specific symbol linking and
> > static promotion when necessary. The linkage type of imported
> > functions changes to AvailableExternallyLinkage, for example. Statics
> > must be promoted in certain cases, and renamed in consistent ways.
> >
> > - GlobalDCE changes to support removing imported functions that were
> > not inlined (very small changes to existing pass logic).
> >
> >
> > f. ThinLTO Import Driver SCC pass:
> >
> > Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> > an SCC pass, enabled only under -fthinlto options. The pass includes
> > utilizing the thin archive (global function index/summary), import
> > decision heuristics, invocation of LTOModule/ModuleLinker routines
> > that perform the import, and any necessary callgraph updates and
> > verification.
> >
> >
> > g. Backend Driver:
> >
> > For a single node build, the gold plugin can simply write a makefile
> > and fork the parallel backend instances directly via parallel make.
> >
> >
> > 3. Stage 3: ThinLTO Tuning and Enhancements
> > ----------------------------------------------------------------
> >
> > This refers to the patches that are not required for ThinLTO to work,
> > but rather to improve compile time, memory, run-time performance and
> > usability.
> >
> >
> > a. Lazy Debug Metadata Linking:
> >
> > The prototype implementation included lazy importing of module-level
> > metadata during the ThinLTO pass finalization (i.e. after all function
> > importing is complete). This actually applies to all module-level
> > metadata, not just debug, although it is the largest. This can be
> > added as a separate set of patches. Changes to BitcodeReader,
> > ValueMapper, ModuleLinker
> >
> >
> > b. Import Tuning:
> >
> > Tuning the import strategy will be an iterative process that will
> > continue to be refined over time. It involves several different types
> > of changes: adding support for recording additional metrics in the
> > function summary, such as profile data and optional heavier-weight IPA
> > analyses, and tuning the import heuristics based on the summary and
> > callsite context.
> >
> >
> > c. Combined Function Map Pruning:
> >
> > The combined function map can be pruned of functions that are unlikely
> > to benefit from being imported. For example, during the phase-2 thin
> > archive plug step we can safely omit large and (with profile data)
> > cold functions, which are unlikely to benefit from being inlined.
> > Additionally, all but one copy of comdat functions can be suppressed.
> >
> >
> > d. Distributed Build System Integration:
> >
> > For a distributed build system, the gold plugin should write the
> > parallel backend invocations into a makefile, including the mapping
> > from the IR file to the real object file path, and exit. Additional
> > work needs to be done in the distributed build system itself to
> > distribute and dispatch the parallel backend jobs to the build
> > cluster.
> >
> >
> > e. Dependence Tracking and Incremental Compiles:
> >
> > In order to support build systems that stage from local disks or
> > network storage, the plugin will optionally support computation of
> > dependent sets of IR files that each module may import from. This can
> > be computed from profile data, if it exists, or from the symbol table
> > and heuristics if not. These dependence sets also enable support for
> > incremental backend compiles.
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150513/a9b3ea1a/attachment.html>

Duncan P. N. Exon Smith

2015-May-14 23:29 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

> On 2015-May-13, at 11:44, Teresa Johnson <tejohnson at google.com>
wrote:
> 
> I've included below an RFC for implementing ThinLTO in LLVM, looking
> forward to feedback and questions.
> Thanks!
> Teresa
> 
> 
> 
> RFC to discuss plans for implementing ThinLTO upstream. Background can
> be found in slides from EuroLLVM 2015:
>  
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> As described in the talk, we have a prototype implementation, and
> would like to start staging patches upstream. This RFC describes a
> breakdown of the major pieces. We would like to commit upstream
> gradually in several stages, with all functionality off by default.
> The core ThinLTO importing support and tuning will require frequent
> change and iteration during testing and tuning, and for that part we
> would like to commit rapidly (off by default). See the proposed staged
> implementation described in the Implementation Plan section.
> 
> 
> ThinLTO Overview
> =============> 
> See the talk slides linked above for more details. The following is a
> high-level overview of the motivation.
> 
> Cross Module Optimization (CMO) is an effective means for improving
> runtime performance, by extending the scope of optimizations across
> source module boundaries. Without CMO, the compiler is limited to
> optimizing within the scope of single source modules. Two solutions
> for enabling CMO are Link-Time Optimization (LTO), which is currently
> supported in LLVM and GCC, and Lightweight-Interprocedural
> Optimization (LIPO). However, each of these solutions has limitations
> that prevent it from being enabled by default. ThinLTO is a new
> approach that attempts to address these limitations, with a goal of
> being enabled more broadly. ThinLTO is designed with many of the same
> principals as LIPO, and therefore its advantages, without any of its
> inherent weakness. Unlike in LIPO where the module group decision is
> made at profile training runtime, ThinLTO makes the decision at
> compile time, but in a lazy mode that facilitates large scale
> parallelism. The serial linker plugin phase is designed to be razor
> thin and blazingly fast. By default this step only does minimal
> preparation work to enable the parallel lazy importing performed
> later. ThinLTO aims to be scalable like a regular O2 build, enabling
> CMO on machines without large memory configurations, while also
> integrating well with distributed build systems. Results from early
> prototyping on SPEC cpu2006 C++ benchmarks are in line with
> expectations that ThinLTO can scale like O2 while enabling much of the
> CMO performed during a full LTO build.
> 
> 
> A ThinLTO build is divided into 3 phases, which are referred to in the
> following implementation plan:
> 
> phase-1: IR and Function Summary Generation (-c compile)
> phase-2: Thin Linker Plugin Layer (thin archive linker step)
> phase-3: Parallel Backend with Demand-Driven Importing
> 
> 
> Implementation Plan
> ===============> 
> This section gives a high-level breakdown of the ThinLTO support that
> will be added, in roughly the order that the patches would be staged.
> The patches are divided into three stages. The first stage contains a
> minimal amount of preparation work that is not ThinLTO-specific. The
> second stage contains most of the infrastructure for ThinLTO, which
> will be off by default. The third stage includes
> enhancements/improvements/tunings that can be performed after the main
> ThinLTO infrastructure is in.
> 
> The second and third implementation stages will initially be very
> volatile, requiring a lot of iterations and tuning with large apps to
> get stabilized. Therefore it will be important to do fast commits for
> these implementation stages.
> 
> 
> 1. Stage 1: Preparation
> -------------------------------
> 
> The first planned sets of patches are enablers for ThinLTO work:
> 
> 
> a. LTO directory structure:
> 
> Restructure the LTO directory to remove circular dependence when
> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> within Transforms/IPO, and leverages the LTOModule class for linking
> in functions from modules, IPO then requires the LTO library. This
> creates a circular dependence between LTO and IPO. To break that, we
> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> the circular dependence.
> 
I wonder whether LTOModule is a good fit (it might be; I'm not sure).
We still use it in libLTO, but gold-plugin.cpp no longer uses it,
instead using lib/Object and lib/Linker directly.
> b. ELF wrapper generation support:
(From elsewhere in the thread, it looks like you're just using ELF
as a short-hand for "native".)
> 
> Implement ELF wrapped bitcode writer. In order to more easily interact
> with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> table. The goal is both to interact with these tools without requiring
> a plugin, and also to avoid doing partial LTO/ThinLTO across files
> linked with “$LD -r” (i.e. the resulting object file should still
> contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
Shouldn't `ld -r` change symbol visibility and such?  How do you plan
to handle that when you concatenate sections?

For reference, ld64 (through libLTO) merges all the bitcode together
with lib/Linker, gives all "hidden" symbols local linkage (by running
-internalize with OnlyHidden=1), and writes out a new bitcode file.
> I will send a separate design document for these changes, but the
> following is a high-level overview.
> 
> Support was added to LLVM for reading ELF-wrapped bitcode
> (http://reviews.llvm.org/rL218078), but there does not yet exist
> support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> add support for optionally generating bitcode in an ELF file
> containing a single .llvmbc section holding the bitcode. Specifically,
> the patch would add new options “emit-llvm-bc-elf” (object file) and
> corresponding “emit-llvm-elf” (textual assembly code equivalent).
If we decide to go this way -- wrapping the bitcode in the native
object format -- wouldn't emit-llvm-native or emit-llvm-object be
better?  The native object format is implied by the triple.
> Eventually these would be automatically triggered under “-fthinlto -c”
> and “-fthinlto -S”, respectively.
> 
> Additionally, a symbol table will be generated in the ELF file,
> holding the function symbols within the bitcode. This facilitates
> handling archives of the ELF-wrapped bitcode created with $AR, since
> the archive will have a symbol table as well. The archive symbol table
> enables gold to extract and pass to the plugin the constituent
> ELF-wrapped bitcode files. To support the concatenated llvmbc section
> generated by “$LD -r”, some handling needs to be added to gold and to
> the backend driver to process each original module’s bitcode.
> 
> The function index/summary will later be added as a special ELF
> section alongside the .llvmbc sections.
> 
> 
> 2. Stage 2: ThinLTO Infrastructure
> ----------------------------------------------
> 
> The next set of patches adds the base implementation of the ThinLTO
> infrastructure, specifically those required to make ThinLTO functional
> and generate correct but not necessarily high-performing binaries. It
> also does not include support to make debug support under -g efficient
> with ThinLTO.
I think we should at least have a vague plan...
> a. Clang/LLVM/gold linker options:
> 
> An early set of clang/llvm patches is needed to provide options to
> enable ThinLTO (off by default), so that the rest of the
> implementation can be disabled by default as it is added.
> Specifically, clang options -fthinlto (used instead of -flto) will
> cause clang to invoke the phase-1 emission of LLVM bitcode and
> function summary/index on a compile step, and pass the appropriate
> option to the gold plugin on a link step. The -thinlto option will be
> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> archive step. The -thinlto option will also be added to the ‘opt’ tool
> to invoke it as a phase-3 parallel backend instance.
I'm not sure I follow the `opt` part of this.  That's a developer
tool, not something we ship.  It also doesn't have a backend (doesn't
do CodeGen).  What am I missing?
> b. Thin-archive linking support in Gold plugin and llvm-lto:
> 
> Under the new plugin option (see above), the plugin needs to perform
> the phase-2 (thin archive) link which simply emits a combined function
> map from the linked modules, without actually performing the normal
> link. Corresponding support should be added to the standalone llvm-lto
> tool to enable testing/debugging without involving the linker and
> plugin.
> 
> 
> c. ThinLTO backend support:
> 
> Support for invoking a phase-3 backend invocation (including
> importing) on a module should be added to the ‘opt’ tool under the new
> option. The main change under the option is to instantiate a Linker
> object used to manage the process of linking imported functions into
> the module, efficient read of the combined function map, and enable
> the ThinLTO import pass.
> 
> 
> d. Function index/summary support:
> 
> This includes infrastructure for writing and reading the function
> index/summary section. As noted earlier this will be encoded in a
> special ELF section within the module, alongside the .llvmbc section
> containing the bitcode. The thin archive generated by phase-2 of
> ThinLTO simply contains all of the function index/summary sections
> across the linked modules, organized for efficient function lookup.
> 
> Each function available for importing from the module contains an
> entry in the module’s function index/summary section and in the
> resulting combined function map. Each function entry contains that
> function’s offset within the bitcode file, used to efficiently locate
> and quickly import just that function.
I don't think you'll actually buy anything here over the lazy-loading
feature in the BitcodeReader (although perhaps you can help improve
it if you have some ideas).  In practice, to correctly load a
Function you need to load constants (include declarations for other
GlobalValues) and metadata that it references.
> The entry also contains summary
> information (e.g. basic information determined during parsing such as
> the number of instructions in the function), that will be used to help
> guide later import decisions. Because the contents of this section
> will change frequently during ThinLTO tuning, it should also be marked
> with a version id for backwards compatibility or version checking.
> 
> 
> e. ThinLTO importing support:
> 
> Support for the mechanics of importing functions from other modules,
> which can go in gradually as a set of patches since it will be off by
> default. Separate patches can include:
> 
> - BitcodeReader changes to use function index to import/deserialize
> single function of interest (small changes, leverages existing lazy
> streamer support).
Ah, here it is.  Should have read ahead.

How do you plan to handle references to other GlobalValues (global
variables, functions, and aliases)?  If you're going to keep loading
the symbol table (which I think you need to?), then the lazy loader
already creates a function index.  Or do you have some other plan?

If an imported function references functions with internal linkage,
will you pull in copies of those functions as well?

If an imported function references global variables with internal
linkage... actually, that doesn't seem legal.  Will you disallow
importing such functions?  How will you mark them?
> 
> - Minor LTOModule changes to pass the ThinLTO function to import and
> its index into bitcode reader.
> 
> - Marking of imported functions (for use in ThinLTO-specific symbol
> linking and global DCE, for example).
Marking how?  Do you mean giving them internal linkage, or something
else?

What's your plan for ThinLTO-specific symbol linking?
> This can be in-memory initially,
> but IR support may be required in order to support streaming bitcode
> out and back in again after importing.
> 
> - ModuleLinker changes to do ThinLTO-specific symbol linking and
> static promotion when necessary. The linkage type of imported
> functions changes to AvailableExternallyLinkage, for example. Statics
> must be promoted in certain cases, and renamed in consistent ways.
Ah, could have read ahead again; this answers my questions about
referencing global variables with local linkage.

It also sounds pretty hairy.  Details welcome.
> 
> - GlobalDCE changes to support removing imported functions that were
> not inlined (very small changes to existing pass logic).
If you give them "available_externally" linkage, won't this
already
happen?
> 
> 
> f. ThinLTO Import Driver SCC pass:
> 
> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> an SCC pass, enabled only under -fthinlto options. The pass includes
> utilizing the thin archive (global function index/summary), import
> decision heuristics, invocation of LTOModule/ModuleLinker routines
> that perform the import, and any necessary callgraph updates and
> verification.
> 
> 
> g. Backend Driver:
> 
> For a single node build, the gold plugin can simply write a makefile
> and fork the parallel backend instances directly via parallel make.
This doesn't seem like the way we'd want to test this, and it
seems strange for the toolchain to require a build system...
> 
> 
> 3. Stage 3: ThinLTO Tuning and Enhancements
> ----------------------------------------------------------------
> 
> This refers to the patches that are not required for ThinLTO to work,
> but rather to improve compile time, memory, run-time performance and
> usability.
> 
> 
> a. Lazy Debug Metadata Linking:
> 
> The prototype implementation included lazy importing of module-level
> metadata during the ThinLTO pass finalization (i.e. after all function
> importing is complete). This actually applies to all module-level
> metadata, not just debug, although it is the largest. This can be
> added as a separate set of patches. Changes to BitcodeReader,
> ValueMapper, ModuleLinker
It sounds like this would work well with the "full" LTO implemented
by tools/gold-plugin right now.  What exactly did you do to improve
this?
> 
> 
> b. Import Tuning:
> 
> Tuning the import strategy will be an iterative process that will
> continue to be refined over time. It involves several different types
> of changes: adding support for recording additional metrics in the
> function summary, such as profile data and optional heavier-weight IPA
> analyses, and tuning the import heuristics based on the summary and
> callsite context.
> 
> 
> c. Combined Function Map Pruning:
> 
> The combined function map can be pruned of functions that are unlikely
> to benefit from being imported. For example, during the phase-2 thin
> archive plug step we can safely omit large and (with profile data)
> cold functions, which are unlikely to benefit from being inlined.
> Additionally, all but one copy of comdat functions can be suppressed.
> 
> 
> d. Distributed Build System Integration:
> 
> For a distributed build system, the gold plugin should write the
> parallel backend invocations into a makefile, including the mapping
> from the IR file to the real object file path, and exit. Additional
> work needs to be done in the distributed build system itself to
> distribute and dispatch the parallel backend jobs to the build
> cluster.
> 
> 
> e. Dependence Tracking and Incremental Compiles:
> 
> In order to support build systems that stage from local disks or
> network storage, the plugin will optionally support computation of
> dependent sets of IR files that each module may import from. This can
> be computed from profile data, if it exists, or from the symbol table
> and heuristics if not. These dependence sets also enable support for
> incremental backend compiles.
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Teresa Johnson

2015-May-15 14:30 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

Thanks for all the feedback and questions, answers below.
Teresa

On Thu, May 14, 2015 at 4:29 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:>
>> On 2015-May-13, at 11:44, Teresa Johnson <tejohnson at
google.com> wrote:
>>
>> I've included below an RFC for implementing ThinLTO in LLVM,
looking
>> forward to feedback and questions.
>> Thanks!
>> Teresa
>>
>>
>>
>> RFC to discuss plans for implementing ThinLTO upstream. Background can
>> be found in slides from EuroLLVM 2015:
>>  
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
>> As described in the talk, we have a prototype implementation, and
>> would like to start staging patches upstream. This RFC describes a
>> breakdown of the major pieces. We would like to commit upstream
>> gradually in several stages, with all functionality off by default.
>> The core ThinLTO importing support and tuning will require frequent
>> change and iteration during testing and tuning, and for that part we
>> would like to commit rapidly (off by default). See the proposed staged
>> implementation described in the Implementation Plan section.
>>
>>
>> ThinLTO Overview
>> =============>>
>> See the talk slides linked above for more details. The following is a
>> high-level overview of the motivation.
>>
>> Cross Module Optimization (CMO) is an effective means for improving
>> runtime performance, by extending the scope of optimizations across
>> source module boundaries. Without CMO, the compiler is limited to
>> optimizing within the scope of single source modules. Two solutions
>> for enabling CMO are Link-Time Optimization (LTO), which is currently
>> supported in LLVM and GCC, and Lightweight-Interprocedural
>> Optimization (LIPO). However, each of these solutions has limitations
>> that prevent it from being enabled by default. ThinLTO is a new
>> approach that attempts to address these limitations, with a goal of
>> being enabled more broadly. ThinLTO is designed with many of the same
>> principals as LIPO, and therefore its advantages, without any of its
>> inherent weakness. Unlike in LIPO where the module group decision is
>> made at profile training runtime, ThinLTO makes the decision at
>> compile time, but in a lazy mode that facilitates large scale
>> parallelism. The serial linker plugin phase is designed to be razor
>> thin and blazingly fast. By default this step only does minimal
>> preparation work to enable the parallel lazy importing performed
>> later. ThinLTO aims to be scalable like a regular O2 build, enabling
>> CMO on machines without large memory configurations, while also
>> integrating well with distributed build systems. Results from early
>> prototyping on SPEC cpu2006 C++ benchmarks are in line with
>> expectations that ThinLTO can scale like O2 while enabling much of the
>> CMO performed during a full LTO build.
>>
>>
>> A ThinLTO build is divided into 3 phases, which are referred to in the
>> following implementation plan:
>>
>> phase-1: IR and Function Summary Generation (-c compile)
>> phase-2: Thin Linker Plugin Layer (thin archive linker step)
>> phase-3: Parallel Backend with Demand-Driven Importing
>>
>>
>> Implementation Plan
>> ===============>>
>> This section gives a high-level breakdown of the ThinLTO support that
>> will be added, in roughly the order that the patches would be staged.
>> The patches are divided into three stages. The first stage contains a
>> minimal amount of preparation work that is not ThinLTO-specific. The
>> second stage contains most of the infrastructure for ThinLTO, which
>> will be off by default. The third stage includes
>> enhancements/improvements/tunings that can be performed after the main
>> ThinLTO infrastructure is in.
>>
>> The second and third implementation stages will initially be very
>> volatile, requiring a lot of iterations and tuning with large apps to
>> get stabilized. Therefore it will be important to do fast commits for
>> these implementation stages.
>>
>>
>> 1. Stage 1: Preparation
>> -------------------------------
>>
>> The first planned sets of patches are enablers for ThinLTO work:
>>
>>
>> a. LTO directory structure:
>>
>> Restructure the LTO directory to remove circular dependence when
>> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
>> within Transforms/IPO, and leverages the LTOModule class for linking
>> in functions from modules, IPO then requires the LTO library. This
>> creates a circular dependence between LTO and IPO. To break that, we
>> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> the circular dependence.
>>
>
> I wonder whether LTOModule is a good fit (it might be; I'm not sure).
> We still use it in libLTO, but gold-plugin.cpp no longer uses it,
> instead using lib/Object and lib/Linker directly.
>
>> b. ELF wrapper generation support:
>
> (From elsewhere in the thread, it looks like you're just using ELF
> as a short-hand for "native".)
Right, I should have written this as native object wrapper. I had
focused on ELF since that was what I have been looking at most
closely, but the support can be more general.
>
>>
>> Implement ELF wrapped bitcode writer. In order to more easily interact
>> with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
>> bitcode wrapped in ELF via the .llvmbc section, along with a symbol
>> table. The goal is both to interact with these tools without requiring
>> a plugin, and also to avoid doing partial LTO/ThinLTO across files
>> linked with “$LD -r” (i.e. the resulting object file should still
>> contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
>
> Shouldn't `ld -r` change symbol visibility and such?  How do you plan
> to handle that when you concatenate sections?
If we use native object wrapped bitcode, ld -r would not do any
changing of symbols or merging. It would be more like an archive in
that it packages the bitcode and delays merging until the backend.
That way it's constituents are still bitcode available for importing
into other modules.

For the non-wrapped bitcode option, using the gold plugin, we would
want to change the behavior for ld -r to be similar to what you are
describing for ld64, i.e. emit bitcode.
>
> For reference, ld64 (through libLTO) merges all the bitcode together
> with lib/Linker, gives all "hidden" symbols local linkage (by
running
> -internalize with OnlyHidden=1), and writes out a new bitcode file.
>
>> I will send a separate design document for these changes, but the
>> following is a high-level overview.
>>
>> Support was added to LLVM for reading ELF-wrapped bitcode
>> (http://reviews.llvm.org/rL218078), but there does not yet exist
>> support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
>> add support for optionally generating bitcode in an ELF file
>> containing a single .llvmbc section holding the bitcode. Specifically,
>> the patch would add new options “emit-llvm-bc-elf” (object file) and
>> corresponding “emit-llvm-elf” (textual assembly code equivalent).
>
> If we decide to go this way -- wrapping the bitcode in the native
> object format -- wouldn't emit-llvm-native or emit-llvm-object be
> better?  The native object format is implied by the triple.
Yes, that is better.
>
>> Eventually these would be automatically triggered under “-fthinlto -c”
>> and “-fthinlto -S”, respectively.
>>
>> Additionally, a symbol table will be generated in the ELF file,
>> holding the function symbols within the bitcode. This facilitates
>> handling archives of the ELF-wrapped bitcode created with $AR, since
>> the archive will have a symbol table as well. The archive symbol table
>> enables gold to extract and pass to the plugin the constituent
>> ELF-wrapped bitcode files. To support the concatenated llvmbc section
>> generated by “$LD -r”, some handling needs to be added to gold and to
>> the backend driver to process each original module’s bitcode.
>>
>> The function index/summary will later be added as a special ELF
>> section alongside the .llvmbc sections.
>>
>>
>> 2. Stage 2: ThinLTO Infrastructure
>> ----------------------------------------------
>>
>> The next set of patches adds the base implementation of the ThinLTO
>> infrastructure, specifically those required to make ThinLTO functional
>> and generate correct but not necessarily high-performing binaries. It
>> also does not include support to make debug support under -g efficient
>> with ThinLTO.
>
> I think we should at least have a vague plan...
Sorry, I should have been clearer here. I do have a plan for this and
know how to do it (it is implemented in my prototype). It's discussed
below under Stage 3. I was debating whether to put the metadata
handling under Stage 2, but it isn't strictly necessary to get the
ThinLTO pipeline working. You just end up with a lot of duplicate
metadata/debug as you have to import it multiple times. But really the
metadata (incl debug) handling should be the next thing after the
basic ThinLTO pipeline is done.
>
>> a. Clang/LLVM/gold linker options:
>>
>> An early set of clang/llvm patches is needed to provide options to
>> enable ThinLTO (off by default), so that the rest of the
>> implementation can be disabled by default as it is added.
>> Specifically, clang options -fthinlto (used instead of -flto) will
>> cause clang to invoke the phase-1 emission of LLVM bitcode and
>> function summary/index on a compile step, and pass the appropriate
>> option to the gold plugin on a link step. The -thinlto option will be
>> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> archive step. The -thinlto option will also be added to the ‘opt’ tool
>> to invoke it as a phase-3 parallel backend instance.
>
> I'm not sure I follow the `opt` part of this.  That's a developer
> tool, not something we ship.  It also doesn't have a backend
(doesn't
> do CodeGen).  What am I missing?
For the prototype I was using llvm-lto as my backend driver. I
realized that this was probably not the best option as we don't need
all of the LTO handling built into that driver, and it isn't listed as
a tool on http://llvm.org/docs/CommandGuide/, so my feeling was that
'opt' was better supported and a better alternative. Unfortunately
when I was writing this up I forgot that 'opt' generates bitcode not
an object file.

Another option would be to use clang and allow it to accept bitcode
and bypass parsing under an appropriate ThinLTO option. AFAICT there
isn't currently an option for clang to accept bitcode. Do you think
this is the right approach?

>
>> b. Thin-archive linking support in Gold plugin and llvm-lto:
>>
>> Under the new plugin option (see above), the plugin needs to perform
>> the phase-2 (thin archive) link which simply emits a combined function
>> map from the linked modules, without actually performing the normal
>> link. Corresponding support should be added to the standalone llvm-lto
>> tool to enable testing/debugging without involving the linker and
>> plugin.
>>
>>
>> c. ThinLTO backend support:
>>
>> Support for invoking a phase-3 backend invocation (including
>> importing) on a module should be added to the ‘opt’ tool under the new
>> option. The main change under the option is to instantiate a Linker
>> object used to manage the process of linking imported functions into
>> the module, efficient read of the combined function map, and enable
>> the ThinLTO import pass.
>>
>>
>> d. Function index/summary support:
>>
>> This includes infrastructure for writing and reading the function
>> index/summary section. As noted earlier this will be encoded in a
>> special ELF section within the module, alongside the .llvmbc section
>> containing the bitcode. The thin archive generated by phase-2 of
>> ThinLTO simply contains all of the function index/summary sections
>> across the linked modules, organized for efficient function lookup.
>>
>> Each function available for importing from the module contains an
>> entry in the module’s function index/summary section and in the
>> resulting combined function map. Each function entry contains that
>> function’s offset within the bitcode file, used to efficiently locate
>> and quickly import just that function.
>
> I don't think you'll actually buy anything here over the
lazy-loading
> feature in the BitcodeReader (although perhaps you can help improve
> it if you have some ideas).  In practice, to correctly load a
> Function you need to load constants (include declarations for other
> GlobalValues) and metadata that it references.
As you saw below, it is leveraging the lazy loading support. The
metadata handling is discussed later on in 3a.
>
>> The entry also contains summary
>> information (e.g. basic information determined during parsing such as
>> the number of instructions in the function), that will be used to help
>> guide later import decisions. Because the contents of this section
>> will change frequently during ThinLTO tuning, it should also be marked
>> with a version id for backwards compatibility or version checking.
>>
>>
>> e. ThinLTO importing support:
>>
>> Support for the mechanics of importing functions from other modules,
>> which can go in gradually as a set of patches since it will be off by
>> default. Separate patches can include:
>>
>> - BitcodeReader changes to use function index to import/deserialize
>> single function of interest (small changes, leverages existing lazy
>> streamer support).
>
> Ah, here it is.  Should have read ahead.
>
> How do you plan to handle references to other GlobalValues (global
> variables, functions, and aliases)? If you're going to keep loading
> the symbol table (which I think you need to?), then the lazy loader
> already creates a function index.  Or do you have some other plan?
We do have to reload the declarations and other symbol table info.
Where it differs from the lazy loader is that we don't need to keep
parsing the module to build up the function index
(DeferredFunctionInfo), with repeated calls to
FindFunctionInStream/ParseModule. Once we hit the first function body
we stop, then when materializing we simply set up the
DeferredFunctionInfo entry from the bitcode index that was saved in
the ThinLTO function index.
>
> If an imported function references functions with internal linkage,
> will you pull in copies of those functions as well?
There are two possibilities in this case: promotion (along with
renaming to avoid name clashing with other modules), or force import.
As you note later on, I talk about promotion just below here. To limit
the required static promotions I have implemented a strategy where we
attempt to force import referenced functions that have internal
linkage. But we still must do static promotion if the local function
(or global) is potentially imported to another module (in the combined
function map) and is address exposed.
>
> If an imported function references global variables with internal
> linkage... actually, that doesn't seem legal.  Will you disallow
> importing such functions?  How will you mark them?
Static promotion handles this.
>
>>
>> - Minor LTOModule changes to pass the ThinLTO function to import and
>> its index into bitcode reader.
>>
>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example).
>
> Marking how?  Do you mean giving them internal linkage, or something
> else?
Mentioned just after this: either an in-memory flag on the Function
class, or potentially in the IR. For the prototype I just had a flag
on the Function class.
>
> What's your plan for ThinLTO-specific symbol linking?
Mentioned just below here as you note.
>
>> This can be in-memory initially,
>> but IR support may be required in order to support streaming bitcode
>> out and back in again after importing.
>>
>> - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> static promotion when necessary. The linkage type of imported
>> functions changes to AvailableExternallyLinkage, for example. Statics
>> must be promoted in certain cases, and renamed in consistent ways.
>
> Ah, could have read ahead again; this answers my questions about
> referencing global variables with local linkage.
>
> It also sounds pretty hairy.  Details welcome.
It has to be well thought out for sure. We had to do this for LIPO as
well so already knew what needed to be done here. I will put together
more details in a follow-on email.
>
>>
>> - GlobalDCE changes to support removing imported functions that were
>> not inlined (very small changes to existing pass logic).
>
> If you give them "available_externally" linkage, won't this
already
> happen?
There were only a couple minor tweaks required here (under the flag I
added to the Function indicating that this was imported). Only
promoted statics are remarked available_externally. For a
non-discardable symbol that was imported, we can discard here since we
are done with inlining (it is non-discardable in its home module).
>
>>
>>
>> f. ThinLTO Import Driver SCC pass:
>>
>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> an SCC pass, enabled only under -fthinlto options. The pass includes
>> utilizing the thin archive (global function index/summary), import
>> decision heuristics, invocation of LTOModule/ModuleLinker routines
>> that perform the import, and any necessary callgraph updates and
>> verification.
>>
>>
>> g. Backend Driver:
>>
>> For a single node build, the gold plugin can simply write a makefile
>> and fork the parallel backend instances directly via parallel make.
>
> This doesn't seem like the way we'd want to test this, and it
> seems strange for the toolchain to require a build system...
The idea is to make this all transparent to the user. So you can just
do something like:
% clang -fthinlto -O2 *.cc -c
% clang -fthinlto -O2 *.o

the second command will do everything transparently (phase-2 thin
plugin later, launch parallel backend processes, hand back resulting
native object code to linker, produce a.out). So somehow the plugin
needs to launch the parallel backend processes.
>
>>
>>
>> 3. Stage 3: ThinLTO Tuning and Enhancements
>> ----------------------------------------------------------------
>>
>> This refers to the patches that are not required for ThinLTO to work,
>> but rather to improve compile time, memory, run-time performance and
>> usability.
>>
>>
>> a. Lazy Debug Metadata Linking:
>>
>> The prototype implementation included lazy importing of module-level
>> metadata during the ThinLTO pass finalization (i.e. after all function
>> importing is complete). This actually applies to all module-level
>> metadata, not just debug, although it is the largest. This can be
>> added as a separate set of patches. Changes to BitcodeReader,
>> ValueMapper, ModuleLinker
>
> It sounds like this would work well with the "full" LTO
implemented
> by tools/gold-plugin right now.  What exactly did you do to improve
> this?
I don't think it will help with full LTO. The parsing of the metadata
is only delayed until the ThinLTO pass finalization, and the delayed
metadata import is necessary to avoid reading and linking in the
metadata multiple times (for each function imported from that module).
Coming out of the ThinLTO pass you still have all the metadata
necessary for each function that was imported. For a full LTO that
would end up being all of the metadata in the module.

The high level summary is that during the initial import it leaves the
temporary metadata on the instructions that were imported, but saves
the index used by the bitcode reader used to correlate with the
metadata when it is ready (i.e. the MDValuePtrs index), and skips the
metadata parsing. During finalization we parse just the metadata, and
suture it up during metadata value mapping using the saved index.
>
>>
>>
>> b. Import Tuning:
>>
>> Tuning the import strategy will be an iterative process that will
>> continue to be refined over time. It involves several different types
>> of changes: adding support for recording additional metrics in the
>> function summary, such as profile data and optional heavier-weight IPA
>> analyses, and tuning the import heuristics based on the summary and
>> callsite context.
>>
>>
>> c. Combined Function Map Pruning:
>>
>> The combined function map can be pruned of functions that are unlikely
>> to benefit from being imported. For example, during the phase-2 thin
>> archive plug step we can safely omit large and (with profile data)
>> cold functions, which are unlikely to benefit from being inlined.
>> Additionally, all but one copy of comdat functions can be suppressed.
>>
>>
>> d. Distributed Build System Integration:
>>
>> For a distributed build system, the gold plugin should write the
>> parallel backend invocations into a makefile, including the mapping
>> from the IR file to the real object file path, and exit. Additional
>> work needs to be done in the distributed build system itself to
>> distribute and dispatch the parallel backend jobs to the build
>> cluster.
>>
>>
>> e. Dependence Tracking and Incremental Compiles:
>>
>> In order to support build systems that stage from local disks or
>> network storage, the plugin will optionally support computation of
>> dependent sets of IR files that each module may import from. This can
>> be computed from profile data, if it exists, or from the symbol table
>> and heuristics if not. These dependence sets also enable support for
>> incremental backend compiles.
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

Eric Christopher

2015-May-15 20:42 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

Hi Teresa,

Very excited to see this work progressing :)

> The second and third implementation stages will initially be very
> volatile, requiring a lot of iterations and tuning with large apps to
> get stabilized. Therefore it will be important to do fast commits for
> these implementation stages.
>
>This sounds interesting. Could use some more description of what you think
is going to be needed here.

>
> 2. Stage 2: ThinLTO Infrastructure
> ----------------------------------------------
>
> The next set of patches adds the base implementation of the ThinLTO
> infrastructure, specifically those required to make ThinLTO functional
> and generate correct but not necessarily high-performing binaries. It
> also does not include support to make debug support under -g efficient
> with ThinLTO.
>
>This is probably something we should give some more thought to up front.
People will definitely want to be able to at least get decent back traces
out of their code (functions, file/line/col, arguments maybe) and leaving
this as an afterthought could cause more efficiency problems down the road.

>
> a. Clang/LLVM/gold linker options:
>
> An early set of clang/llvm patches is needed to provide options to
> enable ThinLTO (off by default), so that the rest of the
> implementation can be disabled by default as it is added.
> Specifically, clang options -fthinlto (used instead of -flto) will
> cause clang to invoke the phase-1 emission of LLVM bitcode and
> function summary/index on a compile step, and pass the appropriate
> option to the gold plugin on a link step. The -thinlto option will be
> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> archive step. The -thinlto option will also be added to the ‘opt’ tool
> to invoke it as a phase-3 parallel backend instance.
>
>
> b. Thin-archive linking support in Gold plugin and llvm-lto:
>
> Under the new plugin option (see above), the plugin needs to perform
> the phase-2 (thin archive) link which simply emits a combined function
> map from the linked modules, without actually performing the normal
> link. Corresponding support should be added to the standalone llvm-lto
> tool to enable testing/debugging without involving the linker and
> plugin.
>
>Have you described thin archives anywhere? I might have missed it, but I'm
curious how you see this working.

>
> c. ThinLTO backend support:
>
> Support for invoking a phase-3 backend invocation (including
> importing) on a module should be added to the ‘opt’ tool under the new
> option. The main change under the option is to instantiate a Linker
> object used to manage the process of linking imported functions into
> the module, efficient read of the combined function map, and enable
> the ThinLTO import pass.
>
In general the phases that you have here sound interesting, but I'm not
sure that I've seen the background describing them? Can you describe this
sort of change here in more detail?

> Each function available for importing from the module contains an
> entry in the module’s function index/summary section and in the
> resulting combined function map. Each function entry contains that
> function’s offset within the bitcode file, used to efficiently locate
> and quickly import just that function. The entry also contains summary
> information (e.g. basic information determined during parsing such as
> the number of instructions in the function), that will be used to help
> guide later import decisions. Because the contents of this section
> will change frequently during ThinLTO tuning, it should also be marked
> with a version id for backwards compatibility or version checking.
>
><Insert bike shed discussion of formatting, versioning, etc>

>
> e. ThinLTO importing support:
>
> Support for the mechanics of importing functions from other modules,
> which can go in gradually as a set of patches since it will be off by
> default. Separate patches can include:
>
> - BitcodeReader changes to use function index to import/deserialize
> single function of interest (small changes, leverages existing lazy
> streamer support).
>
>Sounds like this is trying to optimize the O(n) (effectively) module scan
with an AoT computation of offset in a file. Perhaps it might be worth
adding such a functionality into the module itself anyhow?

> - Marking of imported functions (for use in ThinLTO-specific symbol
> linking and global DCE, for example). This can be in-memory initially,
> but IR support may be required in order to support streaming bitcode
> out and back in again after importing.
>
>How is this different from the existing linkage facilities?

> - ModuleLinker changes to do ThinLTO-specific symbol linking and
> static promotion when necessary. The linkage type of imported
> functions changes to AvailableExternallyLinkage, for example. Statics
> must be promoted in certain cases, and renamed in consistent ways.
>
>Ditto.

> - GlobalDCE changes to support removing imported functions that were
> not inlined (very small changes to existing pass logic).
>
>Ditto.

(I think I've seen some discussion here already, if I should go and read
those threads just feel free to say that :)

>
> f. ThinLTO Import Driver SCC pass:
>
> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> an SCC pass, enabled only under -fthinlto options. The pass includes
> utilizing the thin archive (global function index/summary), import
> decision heuristics, invocation of LTOModule/ModuleLinker routines
> that perform the import, and any necessary callgraph updates and
> verification.
>
>Would it be worth instead of trying to hook some of this in to clang/opt
but have a separate driver to prototype this up? This way the functionality
and the driver could be separate from the rest of the optimization pipeline
as well as making it (I'd hope) be more testable.

We could also use that as a way to test the decision making etc ala some of
the -### stuff out of clang or -debug output. (This description is a bit of
a stretch, but hopefully my point gets across).

> 3. Stage 3: ThinLTO Tuning and Enhancements
> ----------------------------------------------------------------
>
> This refers to the patches that are not required for ThinLTO to work,
> but rather to improve compile time, memory, run-time performance and
> usability.
>
>
> a. Lazy Debug Metadata Linking:
>
> The prototype implementation included lazy importing of module-level
> metadata during the ThinLTO pass finalization (i.e. after all function
> importing is complete). This actually applies to all module-level
> metadata, not just debug, although it is the largest. This can be
> added as a separate set of patches. Changes to BitcodeReader,
> ValueMapper, ModuleLinker
>
Can you describe more of what you've done here? We're trying to optimize
a
lot of these areas for normal LTO as well.

> b. Import Tuning:
>
> Tuning the import strategy will be an iterative process that will
> continue to be refined over time. It involves several different types
> of changes: adding support for recording additional metrics in the
> function summary, such as profile data and optional heavier-weight IPA
> analyses, and tuning the import heuristics based on the summary and
> callsite context.
>
>How is this different from the existing profile work that Diego has been
doing? I.e. how are the formats etc going to communicate?

>
> c. Combined Function Map Pruning:
>
> The combined function map can be pruned of functions that are unlikely
> to benefit from being imported. For example, during the phase-2 thin
> archive plug step we can safely omit large and (with profile data)
> cold functions, which are unlikely to benefit from being inlined.
> Additionally, all but one copy of comdat functions can be suppressed.
>
>The comdat function bit will happen with module linking, but perhaps an
idea would be to make a first pass over the code and:

a) create a new module
b) move cold functions inside while leaving declarations behind
c) migrate comdat functions the same sort of way (though perhaps not out of
line)

One random thought is that you'll need to work on the internalize pass to
handle the distributed information you have.

>
> d. Distributed Build System Integration:
>
> For a distributed build system, the gold plugin should write the
> parallel backend invocations into a makefile, including the mapping
> from the IR file to the real object file path, and exit. Additional
> work needs to be done in the distributed build system itself to
> distribute and dispatch the parallel backend jobs to the build
> cluster.
>
>Hmm? I'd love to see you elaborate here, but it's probably just far
enough
in the future that we can hit that when we get there.

>
> e. Dependence Tracking and Incremental Compiles:
>
> In order to support build systems that stage from local disks or
> network storage, the plugin will optionally support computation of
> dependent sets of IR files that each module may import from. This can
> be computed from profile data, if it exists, or from the symbol table
> and heuristics if not. These dependence sets also enable support for
> incremental backend compiles.
>
>
>Ditto.

-eric

>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/7b487398/attachment.html>

Xinliang David Li

2015-May-15 20:55 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

>
> b. Import Tuning:
>>
>> Tuning the import strategy will be an iterative process that will
>> continue to be refined over time. It involves several different types
>> of changes: adding support for recording additional metrics in the
>> function summary, such as profile data and optional heavier-weight IPA
>> analyses, and tuning the import heuristics based on the summary and
>> callsite context.
>>
>>
> How is this different from the existing profile work that Diego has been
> doing? I.e. how are the formats etc going to communicate?
>
>
ThinLTO summary creation will just be anthor consumer of profile data
producer (Instr, AutoFDO) etc.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/cb341497/attachment.html>

Teresa Johnson

2015-May-16 06:11 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

On Thu, May 14, 2015 at 4:29 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:>
>> On 2015-May-13, at 11:44, Teresa Johnson <tejohnson at
google.com> wrote:
>>
>> a. LTO directory structure:
>>
>> Restructure the LTO directory to remove circular dependence when
>> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
>> within Transforms/IPO, and leverages the LTOModule class for linking
>> in functions from modules, IPO then requires the LTO library. This
>> creates a circular dependence between LTO and IPO. To break that, we
>> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
>> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
>> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
>> the circular dependence.
>>
>
> I wonder whether LTOModule is a good fit (it might be; I'm not sure).
> We still use it in libLTO, but gold-plugin.cpp no longer uses it,
> instead using lib/Object and lib/Linker directly.
>
Forgot to answer this one. I noticed that it was used in some paths
(like by llvm-lto) but not by gold which invokes the lower level
routines more directly. It was convenient to use LTOModule to
encapsulate this, but is there a deliberate movement away from it?

-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

Teresa Johnson

2015-May-16 07:07 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

I wasn't able to write up responses to all of your questions yet, but
a few answers below.
Thanks,
Teresa

On Fri, May 15, 2015 at 1:42 PM, Eric Christopher <echristo at gmail.com>
wrote:> Hi Teresa,
>
> Very excited to see this work progressing :)
Thanks!
>
>>
>> The second and third implementation stages will initially be very
>> volatile, requiring a lot of iterations and tuning with large apps to
>> get stabilized. Therefore it will be important to do fast commits for
>> these implementation stages.
>>
>
> This sounds interesting. Could use some more description of what you think
> is going to be needed here.
>
>>
>>
>> 2. Stage 2: ThinLTO Infrastructure
>> ----------------------------------------------
>>
>> The next set of patches adds the base implementation of the ThinLTO
>> infrastructure, specifically those required to make ThinLTO functional
>> and generate correct but not necessarily high-performing binaries. It
>> also does not include support to make debug support under -g efficient
>> with ThinLTO.
>>
>
> This is probably something we should give some more thought to up front.
> People will definitely want to be able to at least get decent back traces
> out of their code (functions, file/line/col, arguments maybe) and leaving
> this as an afterthought could cause more efficiency problems down the road.
See my response to Duncan on a similar question. It is covered later
on just below this section and is the first thing that should be wired
up after the rest of the ThinLTO handling.
>
>>
>>
>> a. Clang/LLVM/gold linker options:
>>
>> An early set of clang/llvm patches is needed to provide options to
>> enable ThinLTO (off by default), so that the rest of the
>> implementation can be disabled by default as it is added.
>> Specifically, clang options -fthinlto (used instead of -flto) will
>> cause clang to invoke the phase-1 emission of LLVM bitcode and
>> function summary/index on a compile step, and pass the appropriate
>> option to the gold plugin on a link step. The -thinlto option will be
>> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
>> archive step. The -thinlto option will also be added to the ‘opt’ tool
>> to invoke it as a phase-3 parallel backend instance.
>>
>>
>> b. Thin-archive linking support in Gold plugin and llvm-lto:
>>
>> Under the new plugin option (see above), the plugin needs to perform
>> the phase-2 (thin archive) link which simply emits a combined function
>> map from the linked modules, without actually performing the normal
>> link. Corresponding support should be added to the standalone llvm-lto
>> tool to enable testing/debugging without involving the linker and
>> plugin.
>>
>
> Have you described thin archives anywhere? I might have missed it, but
I'm
> curious how you see this working.
Do you mean the format of the file? David and I discussed some ideas
for representing as a native object file with symtab, which I will
include when I update the RFC early next week. This format could
presumably be used even in the case of bitcode as the intermediate
representation for the TUs. It will be consumed by the backend ThinLTO
import pass.
>
>>
>>
>> c. ThinLTO backend support:
>>
>> Support for invoking a phase-3 backend invocation (including
>> importing) on a module should be added to the ‘opt’ tool under the new
>> option. The main change under the option is to instantiate a Linker
>> object used to manage the process of linking imported functions into
>> the module, efficient read of the combined function map, and enable
>> the ThinLTO import pass.
>
>
> In general the phases that you have here sound interesting, but I'm not
sure
> that I've seen the background describing them? Can you describe this
sort of
> change here in more detail?
>
>>
>> Each function available for importing from the module contains an
>> entry in the module’s function index/summary section and in the
>> resulting combined function map. Each function entry contains that
>> function’s offset within the bitcode file, used to efficiently locate
>> and quickly import just that function. The entry also contains summary
>> information (e.g. basic information determined during parsing such as
>> the number of instructions in the function), that will be used to help
>> guide later import decisions. Because the contents of this section
>> will change frequently during ThinLTO tuning, it should also be marked
>> with a version id for backwards compatibility or version checking.
>>
>
> <Insert bike shed discussion of formatting, versioning, etc>
>
>>
>>
>> e. ThinLTO importing support:
>>
>> Support for the mechanics of importing functions from other modules,
>> which can go in gradually as a set of patches since it will be off by
>> default. Separate patches can include:
>>
>> - BitcodeReader changes to use function index to import/deserialize
>> single function of interest (small changes, leverages existing lazy
>> streamer support).
>>
>
> Sounds like this is trying to optimize the O(n) (effectively) module scan
> with an AoT computation of offset in a file. Perhaps it might be worth
> adding such a functionality into the module itself anyhow?
Do you think it would useful in other cases? For each function we also
need summary data to help guide importing decisions. I was assuming
the index and summary data would be stored together somewhere
(separate section in native object wrapper format, TBD in bitcode
format).
>
>>
>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example). This can be in-memory initially,
>> but IR support may be required in order to support streaming bitcode
>> out and back in again after importing.
>>
>
> How is this different from the existing linkage facilities?
There is some discussion between Duncan, David Blaikie and myself on this.
>
>>
>> - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> static promotion when necessary. The linkage type of imported
>> functions changes to AvailableExternallyLinkage, for example. Statics
>> must be promoted in certain cases, and renamed in consistent ways.
>>
>
> Ditto.
This is different because currently during LTO linking you don't need
to change the linkage to available externally. And no static promotion
has to be done.
>
>>
>> - GlobalDCE changes to support removing imported functions that were
>> not inlined (very small changes to existing pass logic).
>>
>
> Ditto.
>
> (I think I've seen some discussion here already, if I should go and
read
> those threads just feel free to say that :)
Yes, this is the same thread I mentioned above with Duncan and David Blaikie.
>
>>
>>
>> f. ThinLTO Import Driver SCC pass:
>>
>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> an SCC pass, enabled only under -fthinlto options. The pass includes
>> utilizing the thin archive (global function index/summary), import
>> decision heuristics, invocation of LTOModule/ModuleLinker routines
>> that perform the import, and any necessary callgraph updates and
>> verification.
>>
>
> Would it be worth instead of trying to hook some of this in to clang/opt
but
> have a separate driver to prototype this up? This way the functionality and
> the driver could be separate from the rest of the optimization pipeline as
> well as making it (I'd hope) be more testable.
I'm not sure this helps much if I am understanding the suggestion
correctly. The pass is inserted in the case of a thin lto backend
compile, which we would do under an option. And many of the other
changes are sprinkled around other passes/infrastructure (e.g. bitcode
reader, module linker) which are shared across other tools.
>
> We could also use that as a way to test the decision making etc ala some of
> the -### stuff out of clang or -debug output. (This description is a bit of
> a stretch, but hopefully my point gets across).
>
>>
>> 3. Stage 3: ThinLTO Tuning and Enhancements
>> ----------------------------------------------------------------
>>
>> This refers to the patches that are not required for ThinLTO to work,
>> but rather to improve compile time, memory, run-time performance and
>> usability.
>>
>>
>> a. Lazy Debug Metadata Linking:
>>
>> The prototype implementation included lazy importing of module-level
>> metadata during the ThinLTO pass finalization (i.e. after all function
>> importing is complete). This actually applies to all module-level
>> metadata, not just debug, although it is the largest. This can be
>> added as a separate set of patches. Changes to BitcodeReader,
>> ValueMapper, ModuleLinker
>
>
> Can you describe more of what you've done here? We're trying to
optimize a
> lot of these areas for normal LTO as well.
See some earlier detail I had sent in response to Duncan's question,
and additional discussion. I don't think this helps normal LTO
unfortunately.
>
>>
>> b. Import Tuning:
>>
>> Tuning the import strategy will be an iterative process that will
>> continue to be refined over time. It involves several different types
>> of changes: adding support for recording additional metrics in the
>> function summary, such as profile data and optional heavier-weight IPA
>> analyses, and tuning the import heuristics based on the summary and
>> callsite context.
>>
>
> How is this different from the existing profile work that Diego has been
> doing? I.e. how are the formats etc going to communicate?
As David and Diego mentioned, ThinLTO is just another consumer of the
profile data.
>
>>
>>
>> c. Combined Function Map Pruning:
>>
>> The combined function map can be pruned of functions that are unlikely
>> to benefit from being imported. For example, during the phase-2 thin
>> archive plug step we can safely omit large and (with profile data)
>> cold functions, which are unlikely to benefit from being inlined.
>> Additionally, all but one copy of comdat functions can be suppressed.
>>
>
> The comdat function bit will happen with module linking, but perhaps an
idea
> would be to make a first pass over the code and:
>
> a) create a new module
> b) move cold functions inside while leaving declarations behind
> c) migrate comdat functions the same sort of way (though perhaps not out of
> line)
Sorry, I didn't follow what you were suggesting here. The pruning
above is just applied to the combined function map, the modules aren't
touched. A function not in the map (no associated index/summary)
simply can't be imported.
>
> One random thought is that you'll need to work on the internalize pass
to
> handle the distributed information you have.
>
>>
>>
>> d. Distributed Build System Integration:
>>
>> For a distributed build system, the gold plugin should write the
>> parallel backend invocations into a makefile, including the mapping
>> from the IR file to the real object file path, and exit. Additional
>> work needs to be done in the distributed build system itself to
>> distribute and dispatch the parallel backend jobs to the build
>> cluster.
>>
>
> Hmm? I'd love to see you elaborate here, but it's probably just far
enough
> in the future that we can hit that when we get there.
>
>>
>>
>> e. Dependence Tracking and Incremental Compiles:
>>
>> In order to support build systems that stage from local disks or
>> network storage, the plugin will optionally support computation of
>> dependent sets of IR files that each module may import from. This can
>> be computed from profile data, if it exists, or from the symbol table
>> and heuristics if not. These dependence sets also enable support for
>> incremental backend compiles.
>>
>>
>
> Ditto.
>
> -eric
>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

Nick Lewycky

2015-May-19 23:09 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

On 13 May 2015 at 11:44, Teresa Johnson <tejohnson at google.com> wrote:
> I've included below an RFC for implementing ThinLTO in LLVM, looking
> forward to feedback and questions.
>
Thanks! I have to admit up front that I haven't read through the whole
thread, but I have a couple comments. Overall this looks like a really nice
design and unusually thorough RFC!

Thanks!> Teresa
>
>
>
> RFC to discuss plans for implementing ThinLTO upstream. Background can
> be found in slides from EuroLLVM 2015:
>
>
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
> As described in the talk, we have a prototype implementation, and
> would like to start staging patches upstream. This RFC describes a
> breakdown of the major pieces. We would like to commit upstream
> gradually in several stages, with all functionality off by default.
> The core ThinLTO importing support and tuning will require frequent
> change and iteration during testing and tuning, and for that part we
> would like to commit rapidly (off by default). See the proposed staged
> implementation described in the Implementation Plan section.
>
>
> ThinLTO Overview
> =============>
> See the talk slides linked above for more details. The following is a
> high-level overview of the motivation.
>
> Cross Module Optimization (CMO) is an effective means for improving
> runtime performance, by extending the scope of optimizations across
> source module boundaries. Without CMO, the compiler is limited to
> optimizing within the scope of single source modules. Two solutions
> for enabling CMO are Link-Time Optimization (LTO), which is currently
> supported in LLVM and GCC, and Lightweight-Interprocedural
> Optimization (LIPO). However, each of these solutions has limitations
> that prevent it from being enabled by default. ThinLTO is a new
> approach that attempts to address these limitations, with a goal of
> being enabled more broadly. ThinLTO is designed with many of the same
> principals as LIPO, and therefore its advantages, without any of its
> inherent weakness. Unlike in LIPO where the module group decision is
> made at profile training runtime, ThinLTO makes the decision at
> compile time, but in a lazy mode that facilitates large scale
> parallelism. The serial linker plugin phase is designed to be razor
> thin and blazingly fast. By default this step only does minimal
> preparation work to enable the parallel lazy importing performed
> later. ThinLTO aims to be scalable like a regular O2 build, enabling
> CMO on machines without large memory configurations, while also
> integrating well with distributed build systems. Results from early
> prototyping on SPEC cpu2006 C++ benchmarks are in line with
> expectations that ThinLTO can scale like O2 while enabling much of the
> CMO performed during a full LTO build.
>
This is different from llvm's current LTO approach ("big bang
LTO", where
we combine all TUs into a single big Module and the optimize and codegen
it). It sounds like there's two goals here, multi-machine parallelism and
reducing memory usage (by splitting the Module out to multiple machines)
and most of the interesting logic goes into deciding where to split a
Module.

I think ThinLTO was designed under the assumption that we would not be able
to fit a large program into memory on a single machine (or that even if we
could, we wouldn't be able to compile quickly enough by employing
multi-core parallelism). This is in contrast to previously considered
approaches of improving big bang LTO to handle very large programs through
changes to the IR, in-memory representation, on-disk representation and
threading. Starting with the assumption that we will need multiple
machines, ThinLTO looks like an excellent design. I just wanted to call out
that design requirement and how it's different from how llvm has thought
about LTO in the past.

A ThinLTO build is divided into 3 phases, which are referred to in
the> following implementation plan:
>
> phase-1: IR and Function Summary Generation (-c compile)
> phase-2: Thin Linker Plugin Layer (thin archive linker step)
> phase-3: Parallel Backend with Demand-Driven Importing
>
>
> Implementation Plan
> ===============>
> This section gives a high-level breakdown of the ThinLTO support that
> will be added, in roughly the order that the patches would be staged.
> The patches are divided into three stages. The first stage contains a
> minimal amount of preparation work that is not ThinLTO-specific. The
> second stage contains most of the infrastructure for ThinLTO, which
> will be off by default. The third stage includes
> enhancements/improvements/tunings that can be performed after the main
> ThinLTO infrastructure is in.
>
> The second and third implementation stages will initially be very
> volatile, requiring a lot of iterations and tuning with large apps to
> get stabilized. Therefore it will be important to do fast commits for
> these implementation stages.
>
>
> 1. Stage 1: Preparation
> -------------------------------
>
> The first planned sets of patches are enablers for ThinLTO work:
>
>
> a. LTO directory structure:
>
> Restructure the LTO directory to remove circular dependence when
> ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
> within Transforms/IPO, and leverages the LTOModule class for linking
> in functions from modules, IPO then requires the LTO library. This
> creates a circular dependence between LTO and IPO. To break that, we
> need to split the lib/LTO directory/library into lib/LTO/CodeGen and
> lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
> respectively. Only LTOCodeGenerator has a dependence on IPO, removing
> the circular dependence.
>
>
> b. ELF wrapper generation support:
>
> Implement ELF wrapped bitcode writer. In order to more easily interact
> with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
> bitcode wrapped in ELF via the .llvmbc section, along with a symbol
> table. The goal is both to interact with these tools without requiring
> a plugin, and also to avoid doing partial LTO/ThinLTO across files
> linked with “$LD -r” (i.e. the resulting object file should still
> contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
> I will send a separate design document for these changes, but the
> following is a high-level overview.
>
> Support was added to LLVM for reading ELF-wrapped bitcode
> (http://reviews.llvm.org/rL218078), but there does not yet exist
> support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
> add support for optionally generating bitcode in an ELF file
> containing a single .llvmbc section holding the bitcode. Specifically,
> the patch would add new options “emit-llvm-bc-elf” (object file) and
> corresponding “emit-llvm-elf” (textual assembly code equivalent).
> Eventually these would be automatically triggered under “-fthinlto -c”
> and “-fthinlto -S”, respectively.
>
> Additionally, a symbol table will be generated in the ELF file,
> holding the function symbols within the bitcode. This facilitates
> handling archives of the ELF-wrapped bitcode created with $AR, since
> the archive will have a symbol table as well. The archive symbol table
> enables gold to extract and pass to the plugin the constituent
> ELF-wrapped bitcode files. To support the concatenated llvmbc section
> generated by “$LD -r”, some handling needs to be added to gold and to
> the backend driver to process each original module’s bitcode.
>
> The function index/summary will later be added as a special ELF
> section alongside the .llvmbc sections.
>
We've historically pushed back on adding ELF because it doesn't add any
new
information that isn't present in the .bc file, and we care a lot about
minimizing I/O time (I recall an encoding change in the bitcode format
shrinking .bc files 10% which led to a big improvement in LTO times for
Darwin).

There's a few practical matters about what needs to be in this ELF symbol
table; what about symbols that we reference, instead of just those we
define? what about the sizes of symbols we define? what about the case
where llvm codegen ends up defining (or referencing) a function that isn't
mentioned in the IR (a common example is emitting a call to memcpy for
argument lowering)? If you have a set of tools in mind, we can make the ELF
accurate enough to work with those tools, but it's not clear to me how to
make it work for fully general ELF-expecting programs without doing full
codegen into the file (IIRC, this is what GCC does). Are 'ar',
'nm' and
'ld' the only programs?

Finally, suppose you get into a situation where you implement ThinLTO with
the elf wrappers and then examine the compile time, memory usage, file size
and I/O, and find that ThinLTO isn't performing as well as we like. The
next question is going to be "well, what if we removed that extra I/O time,
file size (copying time) and memory usage from having that ELF wrapper"?
That's why I think of a .bc-only version as being the ideal version, and
that having ELF wrapping is a good idea for supporting legacy programs as
needed.

2. Stage 2: ThinLTO Infrastructure> ----------------------------------------------
>
> The next set of patches adds the base implementation of the ThinLTO
> infrastructure, specifically those required to make ThinLTO functional
> and generate correct but not necessarily high-performing binaries. It
> also does not include support to make debug support under -g efficient
> with ThinLTO.
>
>
> a. Clang/LLVM/gold linker options:
>
> An early set of clang/llvm patches is needed to provide options to
> enable ThinLTO (off by default), so that the rest of the
> implementation can be disabled by default as it is added.
> Specifically, clang options -fthinlto (used instead of -flto) will
> cause clang to invoke the phase-1 emission of LLVM bitcode and
> function summary/index on a compile step, and pass the appropriate
> option to the gold plugin on a link step. The -thinlto option will be
> added to the gold plugin and llvm-lto tool to launch the phase-2 thin
> archive step. The -thinlto option will also be added to the ‘opt’ tool
> to invoke it as a phase-3 parallel backend instance.
>
>
> b. Thin-archive linking support in Gold plugin and llvm-lto:
>
> Under the new plugin option (see above), the plugin needs to perform
> the phase-2 (thin archive) link which simply emits a combined function
> map from the linked modules, without actually performing the normal
> link. Corresponding support should be added to the standalone llvm-lto
> tool to enable testing/debugging without involving the linker and
> plugin.
>
>
> c. ThinLTO backend support:
>
> Support for invoking a phase-3 backend invocation (including
> importing) on a module should be added to the ‘opt’ tool under the new
> option. The main change under the option is to instantiate a Linker
> object used to manage the process of linking imported functions into
> the module, efficient read of the combined function map, and enable
> the ThinLTO import pass.
>
>
> d. Function index/summary support:
>
> This includes infrastructure for writing and reading the function
> index/summary section. As noted earlier this will be encoded in a
> special ELF section within the module, alongside the .llvmbc section
> containing the bitcode. The thin archive generated by phase-2 of
> ThinLTO simply contains all of the function index/summary sections
> across the linked modules, organized for efficient function lookup.
>
> Each function available for importing from the module contains an
> entry in the module’s function index/summary section and in the
> resulting combined function map. Each function entry contains that
> function’s offset within the bitcode file, used to efficiently locate
> and quickly import just that function. The entry also contains summary
> information (e.g. basic information determined during parsing such as
> the number of instructions in the function), that will be used to help
> guide later import decisions. Because the contents of this section
> will change frequently during ThinLTO tuning, it should also be marked
> with a version id for backwards compatibility or version checking.
>
I have an idea for a future version.

Give passes the ability to write their own summary data at compile time,
and to read them in the backends. Merge these summaries in the link, then
after splitting send the merged summaries to each backend regardless of
whether it imports the function body. For instance, dead argument
elimination could summarize which functions ignore which arguments (either
entirely, or locally except for which arguments in which callees).
Receiving a full graph of this is smaller than the full implementations of
the functions, and yet would allow each backend to do an analysis of the
full graph. Function A's body is in this backend, and A calls B whose body
is not available to this backend. The summary would include that the first
argument to B is dead, so we can optimize away the chain of computation
leading to it in A. (I think a more compelling example will be alias
analysis, but it would make for a messier example.)

Nick

e. ThinLTO importing support:>
> Support for the mechanics of importing functions from other modules,
> which can go in gradually as a set of patches since it will be off by
> default. Separate patches can include:
>
> - BitcodeReader changes to use function index to import/deserialize
> single function of interest (small changes, leverages existing lazy
> streamer support).
>
> - Minor LTOModule changes to pass the ThinLTO function to import and
> its index into bitcode reader.
>
> - Marking of imported functions (for use in ThinLTO-specific symbol
> linking and global DCE, for example). This can be in-memory initially,
> but IR support may be required in order to support streaming bitcode
> out and back in again after importing.
>
> - ModuleLinker changes to do ThinLTO-specific symbol linking and
> static promotion when necessary. The linkage type of imported
> functions changes to AvailableExternallyLinkage, for example. Statics
> must be promoted in certain cases, and renamed in consistent ways.
>
> - GlobalDCE changes to support removing imported functions that were
> not inlined (very small changes to existing pass logic).
>
>
> f. ThinLTO Import Driver SCC pass:
>
> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
> an SCC pass, enabled only under -fthinlto options. The pass includes
> utilizing the thin archive (global function index/summary), import
> decision heuristics, invocation of LTOModule/ModuleLinker routines
> that perform the import, and any necessary callgraph updates and
> verification.
>
>
> g. Backend Driver:
>
> For a single node build, the gold plugin can simply write a makefile
> and fork the parallel backend instances directly via parallel make.
>
>
> 3. Stage 3: ThinLTO Tuning and Enhancements
> ----------------------------------------------------------------
>
> This refers to the patches that are not required for ThinLTO to work,
> but rather to improve compile time, memory, run-time performance and
> usability.
>
>
> a. Lazy Debug Metadata Linking:
>
> The prototype implementation included lazy importing of module-level
> metadata during the ThinLTO pass finalization (i.e. after all function
> importing is complete). This actually applies to all module-level
> metadata, not just debug, although it is the largest. This can be
> added as a separate set of patches. Changes to BitcodeReader,
> ValueMapper, ModuleLinker
>
>
> b. Import Tuning:
>
> Tuning the import strategy will be an iterative process that will
> continue to be refined over time. It involves several different types
> of changes: adding support for recording additional metrics in the
> function summary, such as profile data and optional heavier-weight IPA
> analyses, and tuning the import heuristics based on the summary and
> callsite context.
>
>
> c. Combined Function Map Pruning:
>
> The combined function map can be pruned of functions that are unlikely
> to benefit from being imported. For example, during the phase-2 thin
> archive plug step we can safely omit large and (with profile data)
> cold functions, which are unlikely to benefit from being inlined.
> Additionally, all but one copy of comdat functions can be suppressed.
>
>
> d. Distributed Build System Integration:
>
> For a distributed build system, the gold plugin should write the
> parallel backend invocations into a makefile, including the mapping
> from the IR file to the real object file path, and exit. Additional
> work needs to be done in the distributed build system itself to
> distribute and dispatch the parallel backend jobs to the build
> cluster.
>
>
> e. Dependence Tracking and Incremental Compiles:
>
> In order to support build systems that stage from local disks or
> network storage, the plugin will optionally support computation of
> dependent sets of IR files that each module may import from. This can
> be computed from profile data, if it exists, or from the symbol table
> and heuristics if not. These dependence sets also enable support for
> incremental backend compiles.
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150519/4a60a5b9/attachment.html>

Xinliang David Li

2015-May-20 00:20 UTC

head link

[LLVMdev] RFC: ThinLTO Impementation Plan

On Tue, May 19, 2015 at 4:09 PM, Nick Lewycky <nlewycky at google.com>
wrote:
> On 13 May 2015 at 11:44, Teresa Johnson <tejohnson at google.com>
wrote:
>
>> I've included below an RFC for implementing ThinLTO in LLVM,
looking
>> forward to feedback and questions.
>>
>
> Thanks! I have to admit up front that I haven't read through the whole
> thread, but I have a couple comments. Overall this looks like a really nice
> design and unusually thorough RFC!
>
>
>>
> This is different from llvm's current LTO approach ("big bang
LTO", where
> we combine all TUs into a single big Module and the optimize and codegen
> it). It sounds like there's two goals here, multi-machine parallelism
and
> reducing memory usage (by splitting the Module out to multiple machines)
> and most of the interesting logic goes into deciding where to split a
> Module.
>
> I think ThinLTO was designed under the assumption that we would not be
> able to fit a large program into memory on a single machine (or that even
> if we could, we wouldn't be able to compile quickly enough by employing
> multi-core parallelism). This is in contrast to previously considered
> approaches of improving big bang LTO to handle very large programs through
> changes to the IR, in-memory representation, on-disk representation and
> threading. Starting with the assumption that we will need multiple
> machines, ThinLTO looks like an excellent design. I just wanted to call out
> that design requirement and how it's different from how llvm has
thought
> about LTO in the past.
>
ThinLTO is designed to be Corolla, while LTO will continue to be the
Mercedes :)


>
>> The function index/summary will later be added as a special ELF
>> section alongside the .llvmbc sections.
>>
>
> We've historically pushed back on adding ELF because it doesn't add
any
> new information that isn't present in the .bc file, and we care a lot
about
> minimizing I/O time (I recall an encoding change in the bitcode format
> shrinking .bc files 10% which led to a big improvement in LTO times for
> Darwin).
>
For LTO, which is already highly stressed, small increase in I/O does
matter a lot.

> There's a few practical matters about what needs to be in this ELF
symbol
> table; what about symbols that we reference, instead of just those we
> define?
>
UNDEF symbols or match what plugin does.

> what about the sizes of symbols we define?
>
In elf wrapper, the function is 'defined' in the summary section. Its
offset and size is the summary entry's offset and size.

what about the case where llvm codegen ends up defining (or referencing)
a> function that isn't mentioned in the IR (a common example is emitting a
> call to memcpy for argument lowering)?
>
We don't expect symtab generated for IR matches that with the final object,
for instance dead function elimination can happen etc.

> If you have a set of tools in mind, we can make the ELF accurate enough to
> work with those tools, but it's not clear to me how to make it work for
> fully general ELF-expecting programs without doing full codegen into the
> file (IIRC, this is what GCC does). Are 'ar', 'nm' and
'ld' the only
> programs?
>
ranlib, objcopy.

GCC always wraps IR into ELF wrapper even when it does not generate fat
object. However GCC's IR only ELF file have a customized symtab section.

ICC generates a ELF for IR only case -- with a full ELF symtab generated.
It supports fat object file too.

HP's aCC generates ELF wrapper for intermediate file with full ELF symtab
too.

>
> Finally, suppose you get into a situation where you implement ThinLTO with
> the elf wrappers and then examine the compile time, memory usage, file size
> and I/O, and find that ThinLTO isn't performing as well as we like. The
> next question is going to be "well, what if we removed that extra I/O
time,
> file size (copying time) and memory usage from having that ELF
wrapper"?
> That's why I think of a .bc-only version as being the ideal version,
and
> that having ELF wrapping is a good idea for supporting legacy programs as
> needed.
>
I like the best of both worlds.
>
>
>> I have an idea for a future version.
>
> Give passes the ability to write their own summary data at compile time,
> and to read them in the backends. Merge these summaries in the link, then
> after splitting send the merged summaries to each backend regardless of
> whether it imports the function body. For instance, dead argument
> elimination could summarize which functions ignore which arguments (either
> entirely, or locally except for which arguments in which callees).
> Receiving a full graph of this is smaller than the full implementations of
> the functions, and yet would allow each backend to do an analysis of the
> full graph. Function A's body is in this backend, and A calls B whose
body
> is not available to this backend. The summary would include that the first
> argument to B is dead, so we can optimize away the chain of computation
> leading to it in A. (I think a more compelling example will be alias
> analysis, but it would make for a messier example.)
>
>yes -- that is what we had thought about doing for LIPO -- callgraphs,
whole program aliases are good candidates.  So what you describe is we'd
like to do for thinLTO. Those global analyses are more expensive than the
fast indexing, but can be controlled with knobs.

thanks,

David


> Nick
>
> e. ThinLTO importing support:
>>
>> Support for the mechanics of importing functions from other modules,
>> which can go in gradually as a set of patches since it will be off by
>> default. Separate patches can include:
>>
>> - BitcodeReader changes to use function index to import/deserialize
>> single function of interest (small changes, leverages existing lazy
>> streamer support).
>>
>> - Minor LTOModule changes to pass the ThinLTO function to import and
>> its index into bitcode reader.
>>
>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example). This can be in-memory initially,
>> but IR support may be required in order to support streaming bitcode
>> out and back in again after importing.
>>
>> - ModuleLinker changes to do ThinLTO-specific symbol linking and
>> static promotion when necessary. The linkage type of imported
>> functions changes to AvailableExternallyLinkage, for example. Statics
>> must be promoted in certain cases, and renamed in consistent ways.
>>
>> - GlobalDCE changes to support removing imported functions that were
>> not inlined (very small changes to existing pass logic).
>>
>>
>> f. ThinLTO Import Driver SCC pass:
>>
>> Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
>> an SCC pass, enabled only under -fthinlto options. The pass includes
>> utilizing the thin archive (global function index/summary), import
>> decision heuristics, invocation of LTOModule/ModuleLinker routines
>> that perform the import, and any necessary callgraph updates and
>> verification.
>>
>>
>> g. Backend Driver:
>>
>> For a single node build, the gold plugin can simply write a makefile
>> and fork the parallel backend instances directly via parallel make.
>>
>>
>> 3. Stage 3: ThinLTO Tuning and Enhancements
>> ----------------------------------------------------------------
>>
>> This refers to the patches that are not required for ThinLTO to work,
>> but rather to improve compile time, memory, run-time performance and
>> usability.
>>
>>
>> a. Lazy Debug Metadata Linking:
>>
>> The prototype implementation included lazy importing of module-level
>> metadata during the ThinLTO pass finalization (i.e. after all function
>> importing is complete). This actually applies to all module-level
>> metadata, not just debug, although it is the largest. This can be
>> added as a separate set of patches. Changes to BitcodeReader,
>> ValueMapper, ModuleLinker
>>
>>
>> b. Import Tuning:
>>
>> Tuning the import strategy will be an iterative process that will
>> continue to be refined over time. It involves several different types
>> of changes: adding support for recording additional metrics in the
>> function summary, such as profile data and optional heavier-weight IPA
>> analyses, and tuning the import heuristics based on the summary and
>> callsite context.
>>
>>
>> c. Combined Function Map Pruning:
>>
>> The combined function map can be pruned of functions that are unlikely
>> to benefit from being imported. For example, during the phase-2 thin
>> archive plug step we can safely omit large and (with profile data)
>> cold functions, which are unlikely to benefit from being inlined.
>> Additionally, all but one copy of comdat functions can be suppressed.
>>
>>
>> d. Distributed Build System Integration:
>>
>> For a distributed build system, the gold plugin should write the
>> parallel backend invocations into a makefile, including the mapping
>> from the IR file to the real object file path, and exit. Additional
>> work needs to be done in the distributed build system itself to
>> distribute and dispatch the parallel backend jobs to the build
>> cluster.
>>
>>
>> e. Dependence Tracking and Incremental Compiles:
>>
>> In order to support build systems that stage from local disks or
>> network storage, the plugin will optionally support computation of
>> dependent sets of IR files that each module may import from. This can
>> be computed from profile data, if it exists, or from the symbol table
>> and heuristics if not. These dependence sets also enable support for
>> incremental backend compiles.
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson at google.com |
408-460-2413
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150519/da5dbc9c/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - May 2015 - [LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

[LLVMdev] RFC: ThinLTO Impementation Plan

Maybe Matching Threads