Rui Ueyama via llvm-dev
2019-Apr-22 08:53 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
Hi all, This is a design change proposal to make lld's symbol resolution phase mostly-concurrent without changing the existing semantics. The aim of this change is to speed up the linker on multi-core machines. *Background:* Even though many parts of lld are multi-threaded, the symbol resolution phase is single-threaded. In the symbol resolution phase, the linker does the following: - Read one symbol at a time from each input file and insert it to a hash table. - If we find an undefined symbol that can be resolved by an object file in a static archive file, immediately pull that file out from the archive (which may transitively pull out other files from other archives). The output of this phase is a set of symbols merged by name and a set of object files to be included in an output file. We couldn't use threads in this phase because it was hard to guarantee the determinism of the link result. To see why, assume that more than one archive file define the same symbol (which is not an odd assumption, as it is a common practice to write your own libc functions and override the standard ones by inserting your file before `-lc` in the command line, for example). If two or more threads simultaneously read input files, it is hard to guarantee the link order. To simplify stuff, we chose to not use multi-threading at all in this phase. This phase takes roughly 1/3 of the total execution time for typical programs. *Analysis:* Doing something for every symbol is not necessarily slow even if the number of symbols is large. We have many loops that iterates over all symbol objects (lld's internal representation of symbols) in lld, and the loops don't take too much time. What is actually slow when reading input files is to insert symbol strings into an in-memory hash table. This is particularly worse when linking large programs. Large programs tend to be written in C++, and C++ symbol names are particularly long due to name mangling. Inserting hundreds of thousands of long strings into a hash table is a computationally intensive work. *Proposal:* I propose we split the symbol resolution phase into two phases and use multi-threading in the first phase. In the first phase, we visit *all* files, including ones in archive files, to insert all symbol strings to sharded hash tables. For each symbol string, we insert it to a symbol table with a placeholder symbol object as a value. Each file contains a member `std::vector<Symbol *> Symbols`. Once the first phase is done, file A's N'th symbol and file B's M'th symbol have the same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols are merged by name, but "symbol resolution" in the regular linker's sense is not done yet. This phase is highly parallelizable. In the second phase, we visit each input file serially and do name resolution just like we are currently doing. The only difference is, for each symbol, instead of looking up a symbol table with a symbol string, we just dereference a pointer to obtain a corresponding symbol object which should be extremely fast. Since the symbol resolution algorithm is still single-threaded and doesn't change, the output remains the same -- it is deterministic and if two files define the same symbol, the first file would be chosen. One caveat is with this scheme we effectively ignore archive file's symbol tables. We instead directly read symbols from archive members. This shouldn't change the semantics because an archive file's symbol table should be consistent with its member files. But this may be perceived as a weird behavior because lld would work "correctly" even if an archive file's symbol table is inconsistent, corrupted or not present. *Implementation details:* I don't expect too much change, but it looks like I need to move symbol resolution code (which calls replaceSymbol to replace symbols) from SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the second phase, we can determine if two symbols have the same name without asking to the symbol table. Even in the current architecture, they don't really have to belong to the symbol table, so it seems refactoring it is generally a good idea. I'd do this first before implementing this proposal. Rui -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190422/11d2d3cc/attachment.html>
David Blaikie via llvm-dev
2019-Apr-22 18:53 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
Sounds pretty good to me. An interesting case that occurs to me is situations where there are many libraries specified, but only few are used. In the current scheme, the scalability of the resolution algorithm isn't dependent on the total number of libraries, is it? (It stops once all symbols are resolved - potentially not visiting many unused libraries) I wonder if there are cases where many unused libraries are specified (perhaps in a situation where one project builds multiple executables and some of those executables only use few external libraries, while others use many - but all of the external libraries are written once in a list of external dependencies the project as a whole depends on - or I guess a situation where there are two executables in a project, one uses external libraries A, the other users B, but both just get a link line that specifies A and B) - which would result in this strategy doing potentially significantly more work than the current implementation. But it'll be great to see the data no doubt - can gauge how much this costs (how many parallel threads is the break-even point for this separation of work, how many unused libraries would be needed to overwhelm the benefits, etc). - Dave On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hi all, > > This is a design change proposal to make lld's symbol resolution phase mostly-concurrent without changing the existing semantics. The aim of this change is to speed up the linker on multi-core machines. > > Background: > Even though many parts of lld are multi-threaded, the symbol resolution phase is single-threaded. In the symbol resolution phase, the linker does the following: > > - Read one symbol at a time from each input file and insert it to a hash table. > - If we find an undefined symbol that can be resolved by an object file in a static archive file, immediately pull that file out from the archive (which may transitively pull out other files from other archives). > > The output of this phase is a set of symbols merged by name and a set of object files to be included in an output file. > > We couldn't use threads in this phase because it was hard to guarantee the determinism of the link result. To see why, assume that more than one archive file define the same symbol (which is not an odd assumption, as it is a common practice to write your own libc functions and override the standard ones by inserting your file before `-lc` in the command line, for example). If two or more threads simultaneously read input files, it is hard to guarantee the link order. To simplify stuff, we chose to not use multi-threading at all in this phase. > > This phase takes roughly 1/3 of the total execution time for typical programs. > > Analysis: > > Doing something for every symbol is not necessarily slow even if the number of symbols is large. We have many loops that iterates over all symbol objects (lld's internal representation of symbols) in lld, and the loops don't take too much time. What is actually slow when reading input files is to insert symbol strings into an in-memory hash table. This is particularly worse when linking large programs. Large programs tend to be written in C++, and C++ symbol names are particularly long due to name mangling. Inserting hundreds of thousands of long strings into a hash table is a computationally intensive work. > > Proposal: > > I propose we split the symbol resolution phase into two phases and use multi-threading in the first phase. In the first phase, we visit all files, including ones in archive files, to insert all symbol strings to sharded hash tables. For each symbol string, we insert it to a symbol table with a placeholder symbol object as a value. > > Each file contains a member `std::vector<Symbol *> Symbols`. Once the first phase is done, file A's N'th symbol and file B's M'th symbol have the same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols are merged by name, but "symbol resolution" in the regular linker's sense is not done yet. This phase is highly parallelizable. > > In the second phase, we visit each input file serially and do name resolution just like we are currently doing. The only difference is, for each symbol, instead of looking up a symbol table with a symbol string, we just dereference a pointer to obtain a corresponding symbol object which should be extremely fast. Since the symbol resolution algorithm is still single-threaded and doesn't change, the output remains the same -- it is deterministic and if two files define the same symbol, the first file would be chosen. > > One caveat is with this scheme we effectively ignore archive file's symbol tables. We instead directly read symbols from archive members. This shouldn't change the semantics because an archive file's symbol table should be consistent with its member files. But this may be perceived as a weird behavior because lld would work "correctly" even if an archive file's symbol table is inconsistent, corrupted or not present. > > Implementation details: > > I don't expect too much change, but it looks like I need to move symbol resolution code (which calls replaceSymbol to replace symbols) from SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the second phase, we can determine if two symbols have the same name without asking to the symbol table. Even in the current architecture, they don't really have to belong to the symbol table, so it seems refactoring it is generally a good idea. I'd do this first before implementing this proposal. > > Rui > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Rui Ueyama via llvm-dev
2019-Apr-23 01:49 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
On Tue, Apr 23, 2019 at 3:53 AM David Blaikie <dblaikie at gmail.com> wrote:> Sounds pretty good to me. An interesting case that occurs to me is > situations where there are many libraries specified, but only few are > used. In the current scheme, the scalability of the resolution > algorithm isn't dependent on the total number of libraries, is it? (It > stops once all symbols are resolved - potentially not visiting many > unused libraries) >The cost of the current algorithm actually linear to the number of libraries, because we insert all symbols in archive file's symbol tables to the hash table (so that we are able to know if there's an archive defining some symbol when we see an undefined symbol of the same name). That being said, the proposed new algorithm would do a bit more work than the current algorithm. In the new algorithm, we insert all symbols including undefined ones to the symbol table, while archive file's symbol table contains only defined symbols. So, in theory, if we have a lot of object files in archives that end up not being used, and if we run the linker on a machine with a small number of cores, the new algorithm could be slower in theory. I'd expect that the break-even point is not that high, so I thought that this is in practice not a problem, but that's something I'd like to know by experimenting. I wonder if there are cases where many unused libraries are specified> (perhaps in a situation where one project builds multiple executables > and some of those executables only use few external libraries, while > others use many - but all of the external libraries are written once > in a list of external dependencies the project as a whole depends on - > or I guess a situation where there are two executables in a project, > one uses external libraries A, the other users B, but both just get a > link line that specifies A and B) - which would result in this > strategy doing potentially significantly more work than the current > implementation. > > But it'll be great to see the data no doubt - can gauge how much this > costs (how many parallel threads is the break-even point for this > separation of work, how many unused libraries would be needed to > overwhelm the benefits, etc). > > - Dave > > On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > > Hi all, > > > > This is a design change proposal to make lld's symbol resolution phase > mostly-concurrent without changing the existing semantics. The aim of this > change is to speed up the linker on multi-core machines. > > > > Background: > > Even though many parts of lld are multi-threaded, the symbol resolution > phase is single-threaded. In the symbol resolution phase, the linker does > the following: > > > > - Read one symbol at a time from each input file and insert it to a > hash table. > > - If we find an undefined symbol that can be resolved by an object file > in a static archive file, immediately pull that file out from the archive > (which may transitively pull out other files from other archives). > > > > The output of this phase is a set of symbols merged by name and a set of > object files to be included in an output file. > > > > We couldn't use threads in this phase because it was hard to guarantee > the determinism of the link result. To see why, assume that more than one > archive file define the same symbol (which is not an odd assumption, as it > is a common practice to write your own libc functions and override the > standard ones by inserting your file before `-lc` in the command line, for > example). If two or more threads simultaneously read input files, it is > hard to guarantee the link order. To simplify stuff, we chose to not use > multi-threading at all in this phase. > > > > This phase takes roughly 1/3 of the total execution time for typical > programs. > > > > Analysis: > > > > Doing something for every symbol is not necessarily slow even if the > number of symbols is large. We have many loops that iterates over all > symbol objects (lld's internal representation of symbols) in lld, and the > loops don't take too much time. What is actually slow when reading input > files is to insert symbol strings into an in-memory hash table. This is > particularly worse when linking large programs. Large programs tend to be > written in C++, and C++ symbol names are particularly long due to name > mangling. Inserting hundreds of thousands of long strings into a hash table > is a computationally intensive work. > > > > Proposal: > > > > I propose we split the symbol resolution phase into two phases and use > multi-threading in the first phase. In the first phase, we visit all files, > including ones in archive files, to insert all symbol strings to sharded > hash tables. For each symbol string, we insert it to a symbol table with a > placeholder symbol object as a value. > > > > Each file contains a member `std::vector<Symbol *> Symbols`. Once the > first phase is done, file A's N'th symbol and file B's M'th symbol have the > same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols > are merged by name, but "symbol resolution" in the regular linker's sense > is not done yet. This phase is highly parallelizable. > > > > In the second phase, we visit each input file serially and do name > resolution just like we are currently doing. The only difference is, for > each symbol, instead of looking up a symbol table with a symbol string, we > just dereference a pointer to obtain a corresponding symbol object which > should be extremely fast. Since the symbol resolution algorithm is still > single-threaded and doesn't change, the output remains the same -- it is > deterministic and if two files define the same symbol, the first file would > be chosen. > > > > One caveat is with this scheme we effectively ignore archive file's > symbol tables. We instead directly read symbols from archive members. This > shouldn't change the semantics because an archive file's symbol table > should be consistent with its member files. But this may be perceived as a > weird behavior because lld would work "correctly" even if an archive file's > symbol table is inconsistent, corrupted or not present. > > > > Implementation details: > > > > I don't expect too much change, but it looks like I need to move symbol > resolution code (which calls replaceSymbol to replace symbols) from > SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the > second phase, we can determine if two symbols have the same name without > asking to the symbol table. Even in the current architecture, they don't > really have to belong to the symbol table, so it seems refactoring it is > generally a good idea. I'd do this first before implementing this proposal. > > > > Rui > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190423/e313f53d/attachment.html>
Possibly Parallel Threads
- [RFC] lld: mostly-concurrent symbol resolution
- Expected behavior of lld during LTO for global symbols (Attr Internal/Common)
- Expected behavior of lld during LTO for global symbols (Attr Internal/Common)
- Expected behavior of lld during LTO for global symbols (Attr Internal/Common)
- [LLVMdev] On LLD performance