Rui Ueyama via llvm-dev
2019-Apr-23 01:49 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
On Tue, Apr 23, 2019 at 3:53 AM David Blaikie <dblaikie at gmail.com> wrote:> Sounds pretty good to me. An interesting case that occurs to me is > situations where there are many libraries specified, but only few are > used. In the current scheme, the scalability of the resolution > algorithm isn't dependent on the total number of libraries, is it? (It > stops once all symbols are resolved - potentially not visiting many > unused libraries) >The cost of the current algorithm actually linear to the number of libraries, because we insert all symbols in archive file's symbol tables to the hash table (so that we are able to know if there's an archive defining some symbol when we see an undefined symbol of the same name). That being said, the proposed new algorithm would do a bit more work than the current algorithm. In the new algorithm, we insert all symbols including undefined ones to the symbol table, while archive file's symbol table contains only defined symbols. So, in theory, if we have a lot of object files in archives that end up not being used, and if we run the linker on a machine with a small number of cores, the new algorithm could be slower in theory. I'd expect that the break-even point is not that high, so I thought that this is in practice not a problem, but that's something I'd like to know by experimenting. I wonder if there are cases where many unused libraries are specified> (perhaps in a situation where one project builds multiple executables > and some of those executables only use few external libraries, while > others use many - but all of the external libraries are written once > in a list of external dependencies the project as a whole depends on - > or I guess a situation where there are two executables in a project, > one uses external libraries A, the other users B, but both just get a > link line that specifies A and B) - which would result in this > strategy doing potentially significantly more work than the current > implementation. > > But it'll be great to see the data no doubt - can gauge how much this > costs (how many parallel threads is the break-even point for this > separation of work, how many unused libraries would be needed to > overwhelm the benefits, etc). > > - Dave > > On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > > Hi all, > > > > This is a design change proposal to make lld's symbol resolution phase > mostly-concurrent without changing the existing semantics. The aim of this > change is to speed up the linker on multi-core machines. > > > > Background: > > Even though many parts of lld are multi-threaded, the symbol resolution > phase is single-threaded. In the symbol resolution phase, the linker does > the following: > > > > - Read one symbol at a time from each input file and insert it to a > hash table. > > - If we find an undefined symbol that can be resolved by an object file > in a static archive file, immediately pull that file out from the archive > (which may transitively pull out other files from other archives). > > > > The output of this phase is a set of symbols merged by name and a set of > object files to be included in an output file. > > > > We couldn't use threads in this phase because it was hard to guarantee > the determinism of the link result. To see why, assume that more than one > archive file define the same symbol (which is not an odd assumption, as it > is a common practice to write your own libc functions and override the > standard ones by inserting your file before `-lc` in the command line, for > example). If two or more threads simultaneously read input files, it is > hard to guarantee the link order. To simplify stuff, we chose to not use > multi-threading at all in this phase. > > > > This phase takes roughly 1/3 of the total execution time for typical > programs. > > > > Analysis: > > > > Doing something for every symbol is not necessarily slow even if the > number of symbols is large. We have many loops that iterates over all > symbol objects (lld's internal representation of symbols) in lld, and the > loops don't take too much time. What is actually slow when reading input > files is to insert symbol strings into an in-memory hash table. This is > particularly worse when linking large programs. Large programs tend to be > written in C++, and C++ symbol names are particularly long due to name > mangling. Inserting hundreds of thousands of long strings into a hash table > is a computationally intensive work. > > > > Proposal: > > > > I propose we split the symbol resolution phase into two phases and use > multi-threading in the first phase. In the first phase, we visit all files, > including ones in archive files, to insert all symbol strings to sharded > hash tables. For each symbol string, we insert it to a symbol table with a > placeholder symbol object as a value. > > > > Each file contains a member `std::vector<Symbol *> Symbols`. Once the > first phase is done, file A's N'th symbol and file B's M'th symbol have the > same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols > are merged by name, but "symbol resolution" in the regular linker's sense > is not done yet. This phase is highly parallelizable. > > > > In the second phase, we visit each input file serially and do name > resolution just like we are currently doing. The only difference is, for > each symbol, instead of looking up a symbol table with a symbol string, we > just dereference a pointer to obtain a corresponding symbol object which > should be extremely fast. Since the symbol resolution algorithm is still > single-threaded and doesn't change, the output remains the same -- it is > deterministic and if two files define the same symbol, the first file would > be chosen. > > > > One caveat is with this scheme we effectively ignore archive file's > symbol tables. We instead directly read symbols from archive members. This > shouldn't change the semantics because an archive file's symbol table > should be consistent with its member files. But this may be perceived as a > weird behavior because lld would work "correctly" even if an archive file's > symbol table is inconsistent, corrupted or not present. > > > > Implementation details: > > > > I don't expect too much change, but it looks like I need to move symbol > resolution code (which calls replaceSymbol to replace symbols) from > SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the > second phase, we can determine if two symbols have the same name without > asking to the symbol table. Even in the current architecture, they don't > really have to belong to the symbol table, so it seems refactoring it is > generally a good idea. I'd do this first before implementing this proposal. > > > > Rui > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190423/e313f53d/attachment.html>
Peter Smith via llvm-dev
2019-Apr-23 09:11 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
On Tue, 23 Apr 2019 at 02:50, Rui Ueyama via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > On Tue, Apr 23, 2019 at 3:53 AM David Blaikie <dblaikie at gmail.com> wrote: >> >> Sounds pretty good to me. An interesting case that occurs to me is >> situations where there are many libraries specified, but only few are >> used. In the current scheme, the scalability of the resolution >> algorithm isn't dependent on the total number of libraries, is it? (It >> stops once all symbols are resolved - potentially not visiting many >> unused libraries) > > > The cost of the current algorithm actually linear to the number of libraries, because we insert all symbols in archive file's symbol tables to the hash table (so that we are able to know if there's an archive defining some symbol when we see an undefined symbol of the same name). > > That being said, the proposed new algorithm would do a bit more work than the current algorithm. In the new algorithm, we insert all symbols including undefined ones to the symbol table, while archive file's symbol table contains only defined symbols. So, in theory, if we have a lot of object files in archives that end up not being used, and if we run the linker on a machine with a small number of cores, the new algorithm could be slower in theory. I'd expect that the break-even point is not that high, so I thought that this is in practice not a problem, but that's something I'd like to know by experimenting.I think that a more likely occurrence than unused libraries is sparse use of members in libraries. For example a large utility library could be included for the use of only 1 or 2 members. Although the computational overhead may be small the extra file access time may dominate in those cases, especially on first link. Finding a representative set of programs will be interesting.> >> I wonder if there are cases where many unused libraries are specified >> (perhaps in a situation where one project builds multiple executables >> and some of those executables only use few external libraries, while >> others use many - but all of the external libraries are written once >> in a list of external dependencies the project as a whole depends on - >> or I guess a situation where there are two executables in a project, >> one uses external libraries A, the other users B, but both just get a >> link line that specifies A and B) - which would result in this >> strategy doing potentially significantly more work than the current >> implementation. >> >> But it'll be great to see the data no doubt - can gauge how much this >> costs (how many parallel threads is the break-even point for this >> separation of work, how many unused libraries would be needed to >> overwhelm the benefits, etc). >> >> - Dave >> >> On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >> > >> > Hi all, >> > >> > This is a design change proposal to make lld's symbol resolution phase mostly-concurrent without changing the existing semantics. The aim of this change is to speed up the linker on multi-core machines. >> > >> > Background: >> > Even though many parts of lld are multi-threaded, the symbol resolution phase is single-threaded. In the symbol resolution phase, the linker does the following: >> > >> > - Read one symbol at a time from each input file and insert it to a hash table. >> > - If we find an undefined symbol that can be resolved by an object file in a static archive file, immediately pull that file out from the archive (which may transitively pull out other files from other archives). >> > >> > The output of this phase is a set of symbols merged by name and a set of object files to be included in an output file. >> > >> > We couldn't use threads in this phase because it was hard to guarantee the determinism of the link result. To see why, assume that more than one archive file define the same symbol (which is not an odd assumption, as it is a common practice to write your own libc functions and override the standard ones by inserting your file before `-lc` in the command line, for example). If two or more threads simultaneously read input files, it is hard to guarantee the link order. To simplify stuff, we chose to not use multi-threading at all in this phase. >> > >> > This phase takes roughly 1/3 of the total execution time for typical programs. >> > >> > Analysis: >> > >> > Doing something for every symbol is not necessarily slow even if the number of symbols is large. We have many loops that iterates over all symbol objects (lld's internal representation of symbols) in lld, and the loops don't take too much time. What is actually slow when reading input files is to insert symbol strings into an in-memory hash table. This is particularly worse when linking large programs. Large programs tend to be written in C++, and C++ symbol names are particularly long due to name mangling. Inserting hundreds of thousands of long strings into a hash table is a computationally intensive work. >> > >> > Proposal: >> > >> > I propose we split the symbol resolution phase into two phases and use multi-threading in the first phase. In the first phase, we visit all files, including ones in archive files, to insert all symbol strings to sharded hash tables. For each symbol string, we insert it to a symbol table with a placeholder symbol object as a value. >> > >> > Each file contains a member `std::vector<Symbol *> Symbols`. Once the first phase is done, file A's N'th symbol and file B's M'th symbol have the same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols are merged by name, but "symbol resolution" in the regular linker's sense is not done yet. This phase is highly parallelizable. >> > >> > In the second phase, we visit each input file serially and do name resolution just like we are currently doing. The only difference is, for each symbol, instead of looking up a symbol table with a symbol string, we just dereference a pointer to obtain a corresponding symbol object which should be extremely fast. Since the symbol resolution algorithm is still single-threaded and doesn't change, the output remains the same -- it is deterministic and if two files define the same symbol, the first file would be chosen. >> > >> > One caveat is with this scheme we effectively ignore archive file's symbol tables. We instead directly read symbols from archive members. This shouldn't change the semantics because an archive file's symbol table should be consistent with its member files. But this may be perceived as a weird behavior because lld would work "correctly" even if an archive file's symbol table is inconsistent, corrupted or not present. >> >If I've understood correctly you are proposing a pre-pass on the symbol tables of all the objects that will produce a fast lookup of "Symbol name" to "Symbol Placeholder". I think that this could work if all it is used for is to compare names quickly, I think it might cause differences in behaviour if we try and use any more information about the symbols in the first pass. For example: - There can be more than one object in a static library with a definition of a symbol and it is not guaranteed that the first object in the table is selected in all cases (later object can be loaded first due to a reference to some other symbol definition). - Weak definitions aren't taken into account when choosing which library member (archive symbol table has no concept of weakness). Peter>> > Implementation details: >> > >> > I don't expect too much change, but it looks like I need to move symbol resolution code (which calls replaceSymbol to replace symbols) from SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the second phase, we can determine if two symbols have the same name without asking to the symbol table. Even in the current architecture, they don't really have to belong to the symbol table, so it seems refactoring it is generally a good idea. I'd do this first before implementing this proposal. >> > >> > Rui >> > _______________________________________________ >> > LLVM Developers mailing list >> > llvm-dev at lists.llvm.org >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Rui Ueyama via llvm-dev
2019-Apr-23 09:59 UTC
[llvm-dev] [RFC] lld: mostly-concurrent symbol resolution
On Tue, Apr 23, 2019 at 6:12 PM Peter Smith <peter.smith at linaro.org> wrote:> On Tue, 23 Apr 2019 at 02:50, Rui Ueyama via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > > On Tue, Apr 23, 2019 at 3:53 AM David Blaikie <dblaikie at gmail.com> > wrote: > >> > >> Sounds pretty good to me. An interesting case that occurs to me is > >> situations where there are many libraries specified, but only few are > >> used. In the current scheme, the scalability of the resolution > >> algorithm isn't dependent on the total number of libraries, is it? (It > >> stops once all symbols are resolved - potentially not visiting many > >> unused libraries) > > > > > > The cost of the current algorithm actually linear to the number of > libraries, because we insert all symbols in archive file's symbol tables to > the hash table (so that we are able to know if there's an archive defining > some symbol when we see an undefined symbol of the same name). > > > > That being said, the proposed new algorithm would do a bit more work > than the current algorithm. In the new algorithm, we insert all symbols > including undefined ones to the symbol table, while archive file's symbol > table contains only defined symbols. So, in theory, if we have a lot of > object files in archives that end up not being used, and if we run the > linker on a machine with a small number of cores, the new algorithm could > be slower in theory. I'd expect that the break-even point is not that high, > so I thought that this is in practice not a problem, but that's something > I'd like to know by experimenting. > > I think that a more likely occurrence than unused libraries is sparse > use of members in libraries. For example a large utility library could > be included for the use of only 1 or 2 members. Although the > computational overhead may be small the extra file access time may > dominate in those cases, especially on first link. Finding a > representative set of programs will be interesting. >Archive file's symbol tables are at the beginning of the files, so parsing them has indeed a better locality than parsing a symbol table of each archive member. The performance characteristics would be greatly affected by the file system performance characteristics. If an archive file is just created on the same machine, it is likely that the entire file is in memory cache, and the random access would be very fast. If it is on a cold hard drive, the sparse access pattern would be penalized. On the other hand, there's an interesting optimization that becomes possible only by directly parsing each member's symbol table. Currently, we insert all symbols in archive file's symbol table to the hash table, and when we pull out a file from an archive, we re-insert symbols read from the object file. So, for each defined symbol in an archive member, we insert the same string twice to the hash table. This is unavoidable, because even if we know that an archive file contains an object file defining symbol Foo, we don't really know which symbol entry of the object file defines symbol Foo. We have to insert symbol strings to the hash table to find that out. If we use the same symbol table (member file's symbol table), we don't have to look up the hash table twice, because we know a symbol table entry for each symbol string. The number of defined symbols in archive files is not small for typical programs. Therefore, depending on a usage pattern, there is a possibility that the new scheme runs faster even on a single core machine. I don't know which outperforms the other. It'll be great to see the benchmark results once implemented.>> I wonder if there are cases where many unused libraries are specified > >> (perhaps in a situation where one project builds multiple executables > >> and some of those executables only use few external libraries, while > >> others use many - but all of the external libraries are written once > >> in a list of external dependencies the project as a whole depends on - > >> or I guess a situation where there are two executables in a project, > >> one uses external libraries A, the other users B, but both just get a > >> link line that specifies A and B) - which would result in this > >> strategy doing potentially significantly more work than the current > >> implementation. > >> > >> But it'll be great to see the data no doubt - can gauge how much this > >> costs (how many parallel threads is the break-even point for this > >> separation of work, how many unused libraries would be needed to > >> overwhelm the benefits, etc). > >> > >> - Dave > >> > >> On Mon, Apr 22, 2019 at 1:53 AM Rui Ueyama via llvm-dev > >> <llvm-dev at lists.llvm.org> wrote: > >> > > >> > Hi all, > >> > > >> > This is a design change proposal to make lld's symbol resolution > phase mostly-concurrent without changing the existing semantics. The aim of > this change is to speed up the linker on multi-core machines. > >> > > >> > Background: > >> > Even though many parts of lld are multi-threaded, the symbol > resolution phase is single-threaded. In the symbol resolution phase, the > linker does the following: > >> > > >> > - Read one symbol at a time from each input file and insert it to a > hash table. > >> > - If we find an undefined symbol that can be resolved by an object > file in a static archive file, immediately pull that file out from the > archive (which may transitively pull out other files from other archives). > >> > > >> > The output of this phase is a set of symbols merged by name and a set > of object files to be included in an output file. > >> > > >> > We couldn't use threads in this phase because it was hard to > guarantee the determinism of the link result. To see why, assume that more > than one archive file define the same symbol (which is not an odd > assumption, as it is a common practice to write your own libc functions and > override the standard ones by inserting your file before `-lc` in the > command line, for example). If two or more threads simultaneously read > input files, it is hard to guarantee the link order. To simplify stuff, we > chose to not use multi-threading at all in this phase. > >> > > >> > This phase takes roughly 1/3 of the total execution time for typical > programs. > >> > > >> > Analysis: > >> > > >> > Doing something for every symbol is not necessarily slow even if the > number of symbols is large. We have many loops that iterates over all > symbol objects (lld's internal representation of symbols) in lld, and the > loops don't take too much time. What is actually slow when reading input > files is to insert symbol strings into an in-memory hash table. This is > particularly worse when linking large programs. Large programs tend to be > written in C++, and C++ symbol names are particularly long due to name > mangling. Inserting hundreds of thousands of long strings into a hash table > is a computationally intensive work. > >> > > >> > Proposal: > >> > > >> > I propose we split the symbol resolution phase into two phases and > use multi-threading in the first phase. In the first phase, we visit all > files, including ones in archive files, to insert all symbol strings to > sharded hash tables. For each symbol string, we insert it to a symbol table > with a placeholder symbol object as a value. > >> > > >> > Each file contains a member `std::vector<Symbol *> Symbols`. Once the > first phase is done, file A's N'th symbol and file B's M'th symbol have the > same name if and only if `A.Symbols[N] == B.Symbols[M]`. That means symbols > are merged by name, but "symbol resolution" in the regular linker's sense > is not done yet. This phase is highly parallelizable. > >> > > >> > In the second phase, we visit each input file serially and do name > resolution just like we are currently doing. The only difference is, for > each symbol, instead of looking up a symbol table with a symbol string, we > just dereference a pointer to obtain a corresponding symbol object which > should be extremely fast. Since the symbol resolution algorithm is still > single-threaded and doesn't change, the output remains the same -- it is > deterministic and if two files define the same symbol, the first file would > be chosen. > >> > > >> > One caveat is with this scheme we effectively ignore archive file's > symbol tables. We instead directly read symbols from archive members. This > shouldn't change the semantics because an archive file's symbol table > should be consistent with its member files. But this may be perceived as a > weird behavior because lld would work "correctly" even if an archive file's > symbol table is inconsistent, corrupted or not present. > >> > > > If I've understood correctly you are proposing a pre-pass on the > symbol tables of all the objects that will produce a fast lookup of > "Symbol name" to "Symbol Placeholder". I think that this could work if > all it is used for is to compare names quickly, I think it might cause > differences in behaviour if we try and use any more information about > the symbols in the first pass. For example: > - There can be more than one object in a static library with a > definition of a symbol and it is not guaranteed that the first object > in the table is selected in all cases (later object can be loaded > first due to a reference to some other symbol definition). > - Weak definitions aren't taken into account when choosing which > library member (archive symbol table has no concept of weakness). >Right. The first phase exists only to compare names quickly in the second phase.> > Peter > > >> > Implementation details: > >> > > >> > I don't expect too much change, but it looks like I need to move > symbol resolution code (which calls replaceSymbol to replace symbols) from > SymbolTable.cpp to somewhere else (perhaps to Symbols.cpp), because in the > second phase, we can determine if two symbols have the same name without > asking to the symbol table. Even in the current architecture, they don't > really have to belong to the symbol table, so it seems refactoring it is > generally a good idea. I'd do this first before implementing this proposal. > >> > > >> > Rui > >> > _______________________________________________ > >> > LLVM Developers mailing list > >> > llvm-dev at lists.llvm.org > >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190423/1795d93d/attachment-0001.html>