thr3ads.net - llvm dev - [llvm-dev] [RFC] lld: mostly-concurrent symbol resolution [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2019-Apr-22 08:53 UTC

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

Hi all,

This is a design change proposal to make lld's symbol resolution phase
mostly-concurrent without changing the existing semantics. The aim of this
change is to speed up the linker on multi-core machines.

*Background:*
Even though many parts of lld are multi-threaded, the symbol resolution
phase is single-threaded. In the symbol resolution phase, the linker does
the following:

 - Read one symbol at a time from each input file and insert it to a hash
table.
 - If we find an undefined symbol that can be resolved by an object file in
a static archive file, immediately pull that file out from the archive
(which may transitively pull out other files from other archives).

The output of this phase is a set of symbols merged by name and a set of
object files to be included in an output file.

We couldn't use threads in this phase because it was hard to guarantee the
determinism of the link result. To see why, assume that more than one
archive file define the same symbol (which is not an odd assumption, as it
is a common practice to write your own libc functions and override the
standard ones by inserting your file before `-lc` in the command line, for
example). If two or more threads simultaneously read input files, it is
hard to guarantee the link order. To simplify stuff, we chose to not use
multi-threading at all in this phase.

This phase takes roughly 1/3 of the total execution time for typical
programs.

*Analysis:*

Doing something for every symbol is not necessarily slow even if the number
of symbols is large. We have many loops that iterates over all symbol
objects (lld's internal representation of symbols) in lld, and the loops
don't take too much time. What is actually slow when reading input files is
to insert symbol strings into an in-memory hash table. This is particularly
worse when linking large programs. Large programs tend to be written in
C++, and C++ symbol names are particularly long due to name mangling.
Inserting hundreds of thousands of long strings into a hash table is a
computationally intensive work.

*Proposal:*

I propose we split the symbol resolution phase into two phases and use
multi-threading in the first phase. In the first phase, we visit *all*
files, including ones in archive files, to insert all symbol strings to
sharded hash tables. For each symbol string, we insert it to a symbol table
with a placeholder symbol object as a value.

Each file contains a member `std::vector<Symbol *> Symbols`. Once the
first
phase is done, file A's N'th symbol and file B's M'th symbol
have the same
name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols are
merged by name, but "symbol resolution" in the regular linker's
sense is
not done yet. This phase is highly parallelizable.

In the second phase, we visit each input file serially and do name
resolution just like we are currently doing. The only difference is, for
each symbol, instead of looking up a symbol table with a symbol string, we
just dereference a pointer to obtain a corresponding symbol object which
should be extremely fast. Since the symbol resolution algorithm is still
single-threaded and doesn't change, the output remains the same -- it is
deterministic and if two files define the same symbol, the first file would
be chosen.

One caveat is with this scheme we effectively ignore archive file's symbol
tables. We instead directly read symbols from archive members. This
shouldn't change the semantics because an archive file's symbol table
should be consistent with its member files. But this may be perceived as a
weird behavior because lld would work "correctly" even if an archive
file's
symbol table is inconsistent, corrupted or not present.

*Implementation details:*

I don't expect too much change, but it looks like I need to move symbol
resolution code (which calls replaceSymbol to replace symbols) from
SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the
second phase, we can determine if two symbols have the same name without
asking to the symbol table. Even in the current architecture, they don't
really have to belong to the symbol table, so it seems refactoring it is
generally a good idea. I'd do this first before implementing this proposal.

Rui
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190422/11d2d3cc/attachment.html>

David Blaikie via llvm-dev

2019-Apr-22 18:53 UTC

head link

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

Sounds pretty good to me. An interesting case that occurs to me is
situations where there are many libraries specified, but only few are
used. In the current scheme, the scalability of the resolution
algorithm isn't dependent on the total number of libraries, is it? (It
stops once all symbols are resolved - potentially not visiting many
unused libraries)

I wonder if there are cases where many unused libraries are specified
(perhaps in a situation where one project builds multiple executables
and some of those executables only use few external libraries, while
others use many - but all of the external libraries are written once
in a list of external dependencies the project as a whole depends on -
or I guess a situation where there are two executables in a project,
one uses external libraries A, the other users B, but both just get a
link line that specifies A and B) - which would result in this
strategy doing potentially significantly more work than the current
implementation.

But it'll be great to see the data no doubt - can gauge how much this
costs (how many parallel threads is the break-even point for this
separation of work, how many unused libraries would be needed to
overwhelm the benefits, etc).

- Dave

On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi all,
>
> This is a design change proposal to make lld's symbol resolution phase
mostly-concurrent without changing the existing semantics. The aim of this
change is to speed up the linker on multi-core machines.
>
> Background:
> Even though many parts of lld are multi-threaded, the symbol resolution
phase is single-threaded. In the symbol resolution phase, the linker does the
following:
>
>  - Read one symbol at a time from each input file and insert it to a hash
table.
>  - If we find an undefined symbol that can be resolved by an object file in
a static archive file, immediately pull that file out from the archive (which
may transitively pull out other files from other archives).
>
> The output of this phase is a set of symbols merged by name and a set of
object files to be included in an output file.
>
> We couldn't use threads in this phase because it was hard to guarantee
the determinism of the link result. To see why, assume that more than one
archive file define the same symbol (which is not an odd assumption, as it is a
common practice to write your own libc functions and override the standard ones
by inserting your file before `-lc` in the command line, for example). If two or
more threads simultaneously read input files, it is hard to guarantee the link
order. To simplify stuff, we chose to not use multi-threading at all in this
phase.
>
> This phase takes roughly 1/3 of the total execution time for typical
programs.
>
> Analysis:
>
> Doing something for every symbol is not necessarily slow even if the number
of symbols is large. We have many loops that iterates over all symbol objects
(lld's internal representation of symbols) in lld, and the loops don't
take too much time. What is actually slow when reading input files is to insert
symbol strings into an in-memory hash table. This is particularly worse when
linking large programs. Large programs tend to be written in C++, and C++ symbol
names are particularly long due to name mangling. Inserting hundreds of
thousands of long strings into a hash table is a computationally intensive work.
>
> Proposal:
>
> I propose we split the symbol resolution phase into two phases and use
multi-threading in the first phase. In the first phase, we visit all files,
including ones in archive files, to insert all symbol strings to sharded hash
tables. For each symbol string, we insert it to a symbol table with a
placeholder symbol object as a value.
>
> Each file contains a member `std::vector<Symbol *> Symbols`. Once the
first phase is done, file A's N'th symbol and file B's M'th
symbol have the same name if and only if `A.Symbols[N] == B.Symbols[M]`. That
means symbols are merged by name, but "symbol resolution" in the
regular linker's sense is not done yet. This phase is highly parallelizable.
>
> In the second phase, we visit each input file serially and do name
resolution just like we are currently doing. The only difference is, for each
symbol, instead of looking up a symbol table with a symbol string, we just
dereference a pointer to obtain a corresponding symbol object which should be
extremely fast. Since the symbol resolution algorithm is still single-threaded
and doesn't change, the output remains the same -- it is deterministic and
if two files define the same symbol, the first file would be chosen.
>
> One caveat is with this scheme we effectively ignore archive file's
symbol tables. We instead directly read symbols from archive members. This
shouldn't change the semantics because an archive file's symbol table
should be consistent with its member files. But this may be perceived as a weird
behavior because lld would work "correctly" even if an archive
file's symbol table is inconsistent, corrupted or not present.
>
> Implementation details:
>
> I don't expect too much change, but it looks like I need to move symbol
resolution code (which calls replaceSymbol to replace symbols) from
SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the
second phase, we can determine if two symbols have the same name without asking
to the symbol table. Even in the current architecture, they don't really
have to belong to the symbol table, so it seems refactoring it is generally a
good idea. I'd do this first before implementing this proposal.
>
> Rui
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Rui Ueyama via llvm-dev

2019-Apr-23 01:49 UTC

head link

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

On Tue, Apr 23, 2019 at 3:53 AM David Blaikie <dblaikie at gmail.com>
wrote:
> Sounds pretty good to me. An interesting case that occurs to me is
> situations where there are many libraries specified, but only few are
> used. In the current scheme, the scalability of the resolution
> algorithm isn't dependent on the total number of libraries, is it? (It
> stops once all symbols are resolved - potentially not visiting many
> unused libraries)
>
The cost of the current algorithm actually linear to the number of
libraries, because we insert all symbols in archive file's symbol tables to
the hash table (so that we are able to know if there's an archive defining
some symbol when we see an undefined symbol of the same name).

That being said, the proposed new algorithm would do a bit more work than
the current algorithm. In the new algorithm, we insert all symbols
including undefined ones to the symbol table, while archive file's symbol
table contains only defined symbols. So, in theory, if we have a lot of
object files in archives that end up not being used, and if we run the
linker on a machine with a small number of cores, the new algorithm could
be slower in theory. I'd expect that the break-even point is not that high,
so I thought that this is in practice not a problem, but that's something
I'd like to know by experimenting.

I wonder if there are cases where many unused libraries are
specified> (perhaps in a situation where one project builds multiple executables
> and some of those executables only use few external libraries, while
> others use many - but all of the external libraries are written once
> in a list of external dependencies the project as a whole depends on -
> or I guess a situation where there are two executables in a project,
> one uses external libraries A, the other users B, but both just get a
> link line that specifies A and B) - which would result in this
> strategy doing potentially significantly more work than the current
> implementation.
>
> But it'll be great to see the data no doubt - can gauge how much this
> costs (how many parallel threads is the break-even point for this
> separation of work, how many unused libraries would be needed to
> overwhelm the benefits, etc).
>
> - Dave
>
> On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > Hi all,
> >
> > This is a design change proposal to make lld's symbol resolution
phase
> mostly-concurrent without changing the existing semantics. The aim of this
> change is to speed up the linker on multi-core machines.
> >
> > Background:
> > Even though many parts of lld are multi-threaded, the symbol
resolution
> phase is single-threaded. In the symbol resolution phase, the linker does
> the following:
> >
> >  - Read one symbol at a time from each input file and insert it to a
> hash table.
> >  - If we find an undefined symbol that can be resolved by an object
file
> in a static archive file, immediately pull that file out from the archive
> (which may transitively pull out other files from other archives).
> >
> > The output of this phase is a set of symbols merged by name and a set
of
> object files to be included in an output file.
> >
> > We couldn't use threads in this phase because it was hard to
guarantee
> the determinism of the link result. To see why, assume that more than one
> archive file define the same symbol (which is not an odd assumption, as it
> is a common practice to write your own libc functions and override the
> standard ones by inserting your file before `-lc` in the command line, for
> example). If two or more threads simultaneously read input files, it is
> hard to guarantee the link order. To simplify stuff, we chose to not use
> multi-threading at all in this phase.
> >
> > This phase takes roughly 1/3 of the total execution time for typical
> programs.
> >
> > Analysis:
> >
> > Doing something for every symbol is not necessarily slow even if the
> number of symbols is large. We have many loops that iterates over all
> symbol objects (lld's internal representation of symbols) in lld, and
the
> loops don't take too much time. What is actually slow when reading
input
> files is to insert symbol strings into an in-memory hash table. This is
> particularly worse when linking large programs. Large programs tend to be
> written in C++, and C++ symbol names are particularly long due to name
> mangling. Inserting hundreds of thousands of long strings into a hash table
> is a computationally intensive work.
> >
> > Proposal:
> >
> > I propose we split the symbol resolution phase into two phases and use
> multi-threading in the first phase. In the first phase, we visit all files,
> including ones in archive files, to insert all symbol strings to sharded
> hash tables. For each symbol string, we insert it to a symbol table with a
> placeholder symbol object as a value.
> >
> > Each file contains a member `std::vector<Symbol *> Symbols`.
Once the
> first phase is done, file A's N'th symbol and file B's M'th
symbol have the
> same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols
> are merged by name, but "symbol resolution" in the regular
linker's sense
> is not done yet. This phase is highly parallelizable.
> >
> > In the second phase, we visit each input file serially and do name
> resolution just like we are currently doing. The only difference is, for
> each symbol, instead of looking up a symbol table with a symbol string, we
> just dereference a pointer to obtain a corresponding symbol object which
> should be extremely fast. Since the symbol resolution algorithm is still
> single-threaded and doesn't change, the output remains the same -- it
is
> deterministic and if two files define the same symbol, the first file would
> be chosen.
> >
> > One caveat is with this scheme we effectively ignore archive
file's
> symbol tables. We instead directly read symbols from archive members. This
> shouldn't change the semantics because an archive file's symbol
table
> should be consistent with its member files. But this may be perceived as a
> weird behavior because lld would work "correctly" even if an
archive file's
> symbol table is inconsistent, corrupted or not present.
> >
> > Implementation details:
> >
> > I don't expect too much change, but it looks like I need to move
symbol
> resolution code (which calls replaceSymbol to replace symbols) from
> SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the
> second phase, we can determine if two symbols have the same name without
> asking to the symbol table. Even in the current architecture, they
don't
> really have to belong to the symbol table, so it seems refactoring it is
> generally a good idea. I'd do this first before implementing this
proposal.
> >
> > Rui
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190423/e313f53d/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Apr 2019 - [RFC] lld: mostly-concurrent symbol resolution

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution

Reasonably Related Threads