Hi James, On 05.11.2020 17:59, James Henderson wrote:> (Resending with history trimmed to avoid it getting stuck in moderator > queue). > > Hi Alexey, > > Just an update - I identified the cause of the "Generated debug info > is broken" error message when I tried to build things locally: the > `outStreamer` instance is initialised with the host Triple, instead of > whatever the target's triple is. For example, I build and run LLD on > Windows, which means that a Windows triple will be generated, and > consequently a COFF-emitting streamer will be created, rather than the > ELF-emitting one I'd expect were the triple information to somehow be > derived from the linker flavor/input objects etc. Hard-coding in my > target triple resolved the issue (although I still got the other > warnings mentioned from my game link).Thank you for the details. Actually, I did not test this on Windows. But I would do and update the patch.> > I measured the performance figures using LLD patched as described, and > using the same methodology as my earlier results, and got the following: > > Link-time speed (s): > +-----------------------------+---------------+ > | Package variant | GC 1 (normal) | > +-----------------------------+---------------+ > | Game (DWARF linker) | 53.6 | > | Game (DWARF linker, no ODR) | 63.6 | > | Clang (DWARF linker) | 200.6 | > +-----------------------------+---------------+ > > Output size - Game package (MB): > +-----------------------------+------+ > | Category | GC 1 | > +-----------------------------+------+ > | DWARFLinker (total) | 696 | > | DWARFLinker (DWARF*) | 429 | > | DWARFLinker (other) | 267 | > | DWARFLinker no ODR (total) | 753 | > | DWARFLinker no ODR (DWARF*) | 485 | > | DWARFLinker no ODR (other) | 268 | > +-----------------------------+------+ > > Output size - Clang (MB): > +-----------------------------+------+ > | Category | GC 1 | > +-----------------------------+------+ > | DWARFLinker (total) | 1294 | > | DWARFLinker (DWARF*) | 743 | > | DWARFLinker (other) | 551 | > | DWARFLinker no ODR (total) | 1294 | > | DWARFLinker no ODR (DWARF*) | 743 | > | DWARFLinker no ODR (other) | 551 | > +-----------------------------+------+ > > *DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, > .debug_ranges. > > Peak Working Set Memory usage (GB): > +-----------------------------+------+ > | Package variant | GC 1 | > +-----------------------------+------+ > | Game (DWARFLinker) | 5.7 | > | Game (DWARFLinker, no ODR) | 5.8 | > | Clang (DWARFLinker) | 22.4 | > | Clang (DWARFLinker, no ODR) | 22.5 | > +-----------------------------+------+ > > My opinion is that the time costs of the DWARF Linker approach are not > really practical except on build servers, in the current state of > affairs for larger packages: clang takes 8.8x as long as the > fragmented approach and 11.2x as long as the plain approach (without > the no ODR option). The size saving is certainly good, with my version > of clang 51% of the total output size for the DWARF linker approach > versus the plain approach and 55% of the fragmented approach (though > it is likely that further size savings might be possible for the > latter). The game produced reasonable size savings too: 62% and 74%, > but I'd be surprised if these gains would be enough for people to want > to use the approach in day-to-day situations, which presumably is the > main use-case for smaller DWARF, due to improved debugger load times. > > Interesting to note is that the GCC 7.5 build of clang I've used these > figures with produced no difference in size results between the two > variants, unlike other packages. Consequently, a significant amount of > time is saved for no penalty. > > I'll be interested to see what the time results of the DWARF linker > are once further improvements to it have been made.yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison! Speaking of "Fragmented DWARF" solution, how do you estimate memory requirements to support fragmented object files ? In comments for your Lightning Talk you have mentioned that it would be necessary to "update DebugInfo library to treat the fragmented sections as one continuous section". Do you think it would be cheap to implement? Thank you, Alexey.> > Thanks, > > James >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/3d561c13/attachment.html>
Hi Alexey, On Thu, 5 Nov 2020 at 21:02, Alexey Lapshin <avl.lapshin at gmail.com> wrote:> Hi James, > On 05.11.2020 17:59, James Henderson wrote: > > (Resending with history trimmed to avoid it getting stuck in moderator > queue). > > Hi Alexey, > > Just an update - I identified the cause of the "Generated debug info is > broken" error message when I tried to build things locally: the > `outStreamer` instance is initialised with the host Triple, instead of > whatever the target's triple is. For example, I build and run LLD on > Windows, which means that a Windows triple will be generated, and > consequently a COFF-emitting streamer will be created, rather than the > ELF-emitting one I'd expect were the triple information to somehow be > derived from the linker flavor/input objects etc. Hard-coding in my target > triple resolved the issue (although I still got the other warnings > mentioned from my game link). > > Thank you for the details. Actually, I did not test this on Windows. > But I would do and update the patch. > > > > I measured the performance figures using LLD patched as described, and > using the same methodology as my earlier results, and got the following: > > Link-time speed (s): > +-----------------------------+---------------+ > | Package variant | GC 1 (normal) | > +-----------------------------+---------------+ > | Game (DWARF linker) | 53.6 | > | Game (DWARF linker, no ODR) | 63.6 | > | Clang (DWARF linker) | 200.6 | > +-----------------------------+---------------+ > > Output size - Game package (MB): > +-----------------------------+------+ > | Category | GC 1 | > +-----------------------------+------+ > | DWARFLinker (total) | 696 | > | DWARFLinker (DWARF*) | 429 | > | DWARFLinker (other) | 267 | > | DWARFLinker no ODR (total) | 753 | > | DWARFLinker no ODR (DWARF*) | 485 | > | DWARFLinker no ODR (other) | 268 | > +-----------------------------+------+ > > Output size - Clang (MB): > +-----------------------------+------+ > | Category | GC 1 | > +-----------------------------+------+ > | DWARFLinker (total) | 1294 | > | DWARFLinker (DWARF*) | 743 | > | DWARFLinker (other) | 551 | > | DWARFLinker no ODR (total) | 1294 | > | DWARFLinker no ODR (DWARF*) | 743 | > | DWARFLinker no ODR (other) | 551 | > +-----------------------------+------+ > > *DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, > .debug_ranges. > > Peak Working Set Memory usage (GB): > +-----------------------------+------+ > | Package variant | GC 1 | > +-----------------------------+------+ > | Game (DWARFLinker) | 5.7 | > | Game (DWARFLinker, no ODR) | 5.8 | > | Clang (DWARFLinker) | 22.4 | > | Clang (DWARFLinker, no ODR) | 22.5 | > +-----------------------------+------+ > > My opinion is that the time costs of the DWARF Linker approach are not > really practical except on build servers, in the current state of affairs > for larger packages: clang takes 8.8x as long as the fragmented approach > and 11.2x as long as the plain approach (without the no ODR option). The > size saving is certainly good, with my version of clang 51% of the total > output size for the DWARF linker approach versus the plain approach and 55% > of the fragmented approach (though it is likely that further size savings > might be possible for the latter). The game produced reasonable size > savings too: 62% and 74%, but I'd be surprised if these gains would be > enough for people to want to use the approach in day-to-day situations, > which presumably is the main use-case for smaller DWARF, due to improved > debugger load times. > > Interesting to note is that the GCC 7.5 build of clang I've used these > figures with produced no difference in size results between the two > variants, unlike other packages. Consequently, a significant amount of time > is saved for no penalty. > > I'll be interested to see what the time results of the DWARF linker are > once further improvements to it have been made. > > yep, current time costs of the DWARFLinker are too high. One of the > reasons is that lld handles sections in parallel, while DWARFLinker handles > data sequentially. Probably DWARFLinker numbers could be improved if it > would be possible to teach it to handle data in parallel. Thank you for the > comparison! >No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it's hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they'll require more relocations.> Speaking of "Fragmented DWARF" solution, how do you estimate memory > requirements to support fragmented object files ? >I'm not sure if you're referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread. If it's something else, please let me know. Based on those figures, it's clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues.> In comments for your Lightning Talk you have mentioned that it would be > necessary to "update DebugInfo library to treat the fragmented sections > as one continuous section". Do you think it would be cheap to implement? >I think so. I'd hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to "jump" to the next fragment (section) when it gets to the end of the previous one. I haven't experimented with this, but I wouldn't expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.> Thank you, Alexey. > > > Thanks, > > James > >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/c3f00f6b/attachment.html>
On Fri, Nov 6, 2020 at 2:32 AM James Henderson via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hi Alexey, > > On Thu, 5 Nov 2020 at 21:02, Alexey Lapshin <avl.lapshin at gmail.com> wrote: >> >> Hi James, >> >> On 05.11.2020 17:59, James Henderson wrote: >> >> (Resending with history trimmed to avoid it getting stuck in moderator queue). >> >> Hi Alexey, >> >> Just an update - I identified the cause of the "Generated debug info is broken" error message when I tried to build things locally: the `outStreamer` instance is initialised with the host Triple, instead of whatever the target's triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I'd expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link). >> >> Thank you for the details. Actually, I did not test this on Windows. But I would do and update the patch. >> >> >> >> I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following: >> >> Link-time speed (s): >> +-----------------------------+---------------+ >> | Package variant | GC 1 (normal) | >> +-----------------------------+---------------+ >> | Game (DWARF linker) | 53.6 | >> | Game (DWARF linker, no ODR) | 63.6 | >> | Clang (DWARF linker) | 200.6 | >> +-----------------------------+---------------+ >> >> Output size - Game package (MB): >> +-----------------------------+------+ >> | Category | GC 1 | >> +-----------------------------+------+ >> | DWARFLinker (total) | 696 | >> | DWARFLinker (DWARF*) | 429 | >> | DWARFLinker (other) | 267 | >> | DWARFLinker no ODR (total) | 753 | >> | DWARFLinker no ODR (DWARF*) | 485 | >> | DWARFLinker no ODR (other) | 268 | >> +-----------------------------+------+ >> >> Output size - Clang (MB): >> +-----------------------------+------+ >> | Category | GC 1 | >> +-----------------------------+------+ >> | DWARFLinker (total) | 1294 | >> | DWARFLinker (DWARF*) | 743 | >> | DWARFLinker (other) | 551 | >> | DWARFLinker no ODR (total) | 1294 | >> | DWARFLinker no ODR (DWARF*) | 743 | >> | DWARFLinker no ODR (other) | 551 | >> +-----------------------------+------+ >> >> *DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, .debug_ranges. >> >> Peak Working Set Memory usage (GB): >> +-----------------------------+------+ >> | Package variant | GC 1 | >> +-----------------------------+------+ >> | Game (DWARFLinker) | 5.7 | >> | Game (DWARFLinker, no ODR) | 5.8 | >> | Clang (DWARFLinker) | 22.4 | >> | Clang (DWARFLinker, no ODR) | 22.5 | >> +-----------------------------+------+ >> >> My opinion is that the time costs of the DWARF Linker approach are not really practical except on build servers, in the current state of affairs for larger packages: clang takes 8.8x as long as the fragmented approach and 11.2x as long as the plain approach (without the no ODR option). The size saving is certainly good, with my version of clang 51% of the total output size for the DWARF linker approach versus the plain approach and 55% of the fragmented approach (though it is likely that further size savings might be possible for the latter). The game produced reasonable size savings too: 62% and 74%, but I'd be surprised if these gains would be enough for people to want to use the approach in day-to-day situations, which presumably is the main use-case for smaller DWARF, due to improved debugger load times. >> >> Interesting to note is that the GCC 7.5 build of clang I've used these figures with produced no difference in size results between the two variants, unlike other packages. Consequently, a significant amount of time is saved for no penalty. >> >> I'll be interested to see what the time results of the DWARF linker are once further improvements to it have been made. >> >> yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison! > > No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it's hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they'll require more relocations. > >> >> Speaking of "Fragmented DWARF" solution, how do you estimate memory requirements to support fragmented object files ? > > I'm not sure if you're referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread. If it's something else, please let me know. Based on those figures, it's clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues. >> >> In comments for your Lightning Talk you have mentioned that it would be necessary to "update DebugInfo library to treat the fragmented sections as one continuous section". Do you think it would be cheap to implement? > > I think so. I'd hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to "jump" to the next fragment (section) when it gets to the end of the previous one. I haven't experimented with this, but I wouldn't expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.sizeof(InputSection) is 208 (sizeof(Elf64_Shdr)=64) so there is indeed a significant overhead on fragmented segments. A MergeInputSection can be split into SectionPiece, which is indeed lightweight and MarkLive can mark liveness on these pieces. However, in InputFiles.cpp we change MergeInputSection to regular if it has a relocation (toRegularSection). Using more lightweight data structures for .debug_* fragments is still challenging.>> Thank you, Alexey. >> >> >> Thanks, >> >> James-- 宋方睿
On 06.11.2020 13:32, James Henderson wrote:> Hi Alexey, > > On Thu, 5 Nov 2020 at 21:02, Alexey Lapshin <avl.lapshin at gmail.com > <mailto:avl.lapshin at gmail.com>> wrote: > > Hi James, > > On 05.11.2020 17:59, James Henderson wrote: >> (Resending with history trimmed to avoid it getting stuck in >> moderator queue). >> >> Hi Alexey, >> >> Just an update - I identified the cause of the "Generated debug >> info is broken" error message when I tried to build things >> locally: the `outStreamer` instance is initialised with the host >> Triple, instead of whatever the target's triple is. For example, >> I build and run LLD on Windows, which means that a Windows triple >> will be generated, and consequently a COFF-emitting streamer will >> be created, rather than the ELF-emitting one I'd expect were the >> triple information to somehow be derived from the linker >> flavor/input objects etc. Hard-coding in my target triple >> resolved the issue (although I still got the other warnings >> mentioned from my game link). > > Thank you for the details. Actually, I did not test this on > Windows. But I would do and update the patch. > > >> >> I measured the performance figures using LLD patched as >> described, and using the same methodology as my earlier results, >> and got the following: >> >> Link-time speed (s): >> +-----------------------------+---------------+ >> | Package variant | GC 1 (normal) | >> +-----------------------------+---------------+ >> | Game (DWARF linker) | 53.6 | >> | Game (DWARF linker, no ODR) | 63.6 | >> | Clang (DWARF linker) | 200.6 | >> +-----------------------------+---------------+ >> >> Output size - Game package (MB): >> +-----------------------------+------+ >> | Category | GC 1 | >> +-----------------------------+------+ >> | DWARFLinker (total) | 696 | >> | DWARFLinker (DWARF*) | 429 | >> | DWARFLinker (other) | 267 | >> | DWARFLinker no ODR (total) | 753 | >> | DWARFLinker no ODR (DWARF*) | 485 | >> | DWARFLinker no ODR (other) | 268 | >> +-----------------------------+------+ >> >> Output size - Clang (MB): >> +-----------------------------+------+ >> | Category | GC 1 | >> +-----------------------------+------+ >> | DWARFLinker (total) | 1294 | >> | DWARFLinker (DWARF*) | 743 | >> | DWARFLinker (other) | 551 | >> | DWARFLinker no ODR (total) | 1294 | >> | DWARFLinker no ODR (DWARF*) | 743 | >> | DWARFLinker no ODR (other) | 551 | >> +-----------------------------+------+ >> >> *DWARF = just .debug_info, .debug_line, .debug_loc, >> .debug_aranges, .debug_ranges. >> >> Peak Working Set Memory usage (GB): >> +-----------------------------+------+ >> | Package variant | GC 1 | >> +-----------------------------+------+ >> | Game (DWARFLinker) | 5.7 | >> | Game (DWARFLinker, no ODR) | 5.8 | >> | Clang (DWARFLinker) | 22.4 | >> | Clang (DWARFLinker, no ODR) | 22.5 | >> +-----------------------------+------+ >> >> My opinion is that the time costs of the DWARF Linker approach >> are not really practical except on build servers, in the current >> state of affairs for larger packages: clang takes 8.8x as long as >> the fragmented approach and 11.2x as long as the plain approach >> (without the no ODR option). The size saving is certainly good, >> with my version of clang 51% of the total output size for the >> DWARF linker approach versus the plain approach and 55% of the >> fragmented approach (though it is likely that further size >> savings might be possible for the latter). The game produced >> reasonable size savings too: 62% and 74%, but I'd be surprised if >> these gains would be enough for people to want to use the >> approach in day-to-day situations, which presumably is the main >> use-case for smaller DWARF, due to improved debugger load times. >> >> Interesting to note is that the GCC 7.5 build of clang I've used >> these figures with produced no difference in size results between >> the two variants, unlike other packages. Consequently, a >> significant amount of time is saved for no penalty. >> >> I'll be interested to see what the time results of the DWARF >> linker are once further improvements to it have been made. > > yep, current time costs of the DWARFLinker are too high. One of > the reasons is that lld handles sections in parallel, while > DWARFLinker handles data sequentially. Probably DWARFLinker > numbers could be improved if it would be possible to teach it to > handle data in parallel. Thank you for the comparison! > > No problem! It was useful for me to gather the numbers for internal > investigations too. Parallelisation would hopefully help, but at this > point it's hard to say by how much. There are likely going to be > additional time costs for fragmented DWARF too, once I fix the > remaining deficiencies, as they'll require more relocations. > > Speaking of "Fragmented DWARF" solution, how do you estimate > memory requirements to support fragmented object files ? > > I'm not sure if you're referring to the memory usage at link time or > the disk space required for the inputs, but I posted both those > figures in my original post in this thread.I mean the run-time memory usage of DebugInfoDWARF library. Currently, when Object file is loaded and DWARFContext class is created the DWARFContext references section data from object::ObjectFile: DWARFContext(std::unique_ptr<const DWARFObject> DObj,..) DWARFObjInMemory(const object::ObjectFile &Obj, ...) class DWARFObjInMemory { const DWARFSection &getLocSection() const; const DWARFSection &getLoclistsSection() const; StringRef getArangesSection() const; const DWARFSection &getFrameSection() const; const DWARFSection &getEHFrameSection() const; const DWARFSection &getLineSection() const; StringRef getLineStrSection() const; } class DWARFUnit { DWARFContext &Context; /// Section containing this DWARFUnit. const DWARFSection &InfoSection; } struct DWARFSection { StringRef Data; }; DWARFSection references data that are loaded by Object file. DWARFSection is assumed to be a monolithic piece of data. There is a code using these data assuming random access: StringRef LineData = OrigDwarf.getDWARFObj().getLineSection().Data; LineData.slice(*StmtList + 4, PrologueEnd) ... StringRef FrameData = OrigDwarf.getDWARFObj().getFrameSection().Data; FrameData.substr(EntryOffset, InitialLength + 4) ... InputSec = Dwarf.getDWARFObj().getLocSection(); InputSec.Data.substr(Offset, Length); ... DWARFDataExtractor RangesData(Context.getDWARFObj(), *RangeSection, isLittleEndian, getAddressByteSize()); uint64_t ActualRangeListOffset = RangeSectionBase + RangeListOffset; RangeList.extract(RangesData, &ActualRangeListOffset); i.e. It is possible to access random piece of DWARFSection. If object::ObjectFile would contain fragmented sections then we need a solution of how that could work. One possibility is to create a glued copy of fragmented data and pass it to the DWARFObj. But that would require to load all original debug info sections twice (fragmented sections inside Objectfile and glued sections inside DWARFObj). Another possibility is to rewrite DebugInfoDWARF/DWARFSection to avoid random access to the data(if that is possible).> If it's something else, please let me know. Based on those figures, > it's clear the cost depends on the input code base, but it was between > 25 and 75% or so bigger object file size and 50 and 100% more memory > usage. Again, these are likely both to go up when I get around to > fixing the remaining issues. > > In comments for your Lightning Talk you have mentioned that it > would be necessary to "update DebugInfo library to treat the > fragmented sections as one continuous section". Do you think it > would be cheap to implement? > > I think so. I'd hope it would be possible to replace the data buffer > underlying the DWARF section parsing to be able to "jump" to the next > fragment (section) when it gets to the end of the previous one. I > haven't experimented with this, but I wouldn't expect it to be costly > in terms of code quality or performance, at least in comparison to > parsing the DWARF itself.So it looks like you assume the second case: avoiding random access to the section data.> Thank you, Alexey. > >> >> Thanks, >> >> James >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201109/18a329c9/attachment.html>