Alexey Lapshin via llvm-dev
2020-Oct-27 08:23 UTC
[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.
On 26.10.2020 22:38, David Blaikie wrote:> > > On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at gmail.com > <mailto:avl.lapshin at gmail.com>> wrote: > > > On 23.10.2020 19:43, David Blaikie wrote: >> >>> >>> >>> >>> Ah, yeah - that seems like a missed opportunity - >>> duplicating the whole type DIE. LTO does this by making >>> monolithic types - merging all the members from different >>> definitions of the same type into one, but that's maybe too >>> expensive for dsymutil (might still be interesting to know >>> how much more expensive, etc). But I think the other way to >>> go would be to produce a declaration of the type, with the >>> relevant members - and let the DWARF consumer identify this >>> declaration as matching up with the earlier definition. >>> That's the sort of DWARF you get from the non-MachO default >>> -fno-standalone-debug anyway, so it's already pretty well >>> tested/supported (support in lldb's a bit younger/more >>> work-in-progress, admittedly). I wonder how much dsym size >>> there is that could be reduced by such an implementation. >> >> I see. Yes, that could be done and I think it would result in >> noticeable size reduction(I do not know exact numbers at the >> moment). >> >> I work on multi-thread DWARFLinker now and it`s first version >> will do exactly the same type processing like current dsymutil. >> >> Yeah, best to keep the behavior the same through that >> >> Above scheme could be implemented as a next step and it would >> result in better size reduction(better than current state). >> >> But I think the better scheme could be done also and it would >> result in even bigger size reduction and in faster execution. >> This scheme is something similar to what you`ve described >> above: "LTO does - making monolithic types - merging all the >> members from different definitions of the same type into one". >> >> I believe the reason that's probably not been done is that it >> can't be streamed - it'd lead to buffering more of the output > > yes. The fact that DWARF should be streamed into AsmPrinter > complicates parallel dwarf generation. In my prototype, I generate > several resulting files(each for one source compilation unit) and > then sequentially glue them into the final resulting file. > > How does that help? Do you use relocations in those intermediate > object files so the DWARF in them can refer across files?It does not help with referring across the file. It helps to parallel the generation of CU bodies. It is not possible to write two CUs in parallel into AsmPrinter. To make possible parallel generation I stream them into different AsmPrinters(this comment is for "I believe the reason that's probably not been done is that it can't be streamed". which initially was about referring across the file, but it seems I added another direction).> >> (if two of these expandable types were in one CU - the start of >> the second type couldn't be known until the end because it might >> keep getting pushed later due to expansion of the first type) >> and/or having to revisit all the type references (the offset to >> the second type wouldn't be known until the end - so writing the >> offsets to refer to the type would have to be deferred until then). > > That is the second problem: offsets are not known until the end of > file. > dsymutil already has that situation for inter-CU references, so it > has extra pass to > fixup offsets. > > Oh, it does? I figured it was one-pass, and that it only ever refers > back to types in previous CUs? So it doesn't have to go back and do a > second pass. But I guess if sees a declaration of T1 in CU1, then > later on sees a definition of T1 in CU2, does it somehow go back to > CU1 and remove the declaration/make references refer to the definition > in CU2? I figured it'd just leave the declaration and references to it > as-is, then add the definition and use that from CU2 onwards?For the processing of the types, it do not go back. This "I figured it was one-pass, and that it only ever refers back to types in previous CUs" and this "I figured it'd just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards" are correct.> With multi-thread implementation such situation would arise more > often > for type references and so more offsets should be fixed during > additional pass. > >> DWARFLinker could create additional artificial compile unit >> and put all merged types there. Later patch all type >> references to point into this additional compilation unit. >> No any bits would be duplicated in that case. The performance >> improvement could be achieved due to less amount of the >> copied DWARF and due to the fact that type references could >> be updated when DWARF is cloned(no need in additional pass >> for that). >> >> "later patch all type references to point into this additional >> compilation unit" - that's the additional pass that people are >> probably talking/concerned about. Rewalking all the DWARF. The >> current dsymutil approach, as far as I know, is single pass - it >> knows the final, absolute offset to the type from the moment it >> emits that type/needs to refer to it. > > Right. Current dsymutil approach is single pass. And from that > point of view, solution > which you`ve described(to produce a declaration of the type, with > the relevant members) > allows to keep that single pass implementation. > > But there is a restriction for current dsymutil approach: To > process inter-CU references > it needs to load all DWARF into the memory(While it analyzes which > part of DWARF is live, > it needs to have all CUs loaded into the memory). > > All DWARF for a single file (which for dsymutil is mostly a single CU, > except with LTO I guess?), not all DWARF for all inputs in memory at > once, yeah?right. In dsymutil case - all DWARF for a single file(not all DWARF for all inputs in memory at once). But in llvm-dwarfutil case single file contains DWARF for all original input object files and it all becomes loaded into memory.> That leads to huge memory usage. > It is less important when source is a set of object files(like in > dsymutil case) and this > become a real problem for llvm-dwarfutil utility when source is a > single file(With current > implementation it needs 30G of memory for compiling clang binary). > > Yeah, that's where I think you'd need a fixup pass one way or another > - because cross-CU references can mean that when you figure out a new > layout for CU5 (because it has a duplicate type definition of > something in CU1) then you might have to touch CU4 that had an > absolute/cross-CU forward reference to CU5. Once you've got such a > fixup pass (if dsymutil already has one? Which, like I said, I'm > confused why it would have one/that doesn't match my very vague > understanding) then I think you could make dsymutil work on a per-CU > basis streaming things out, then fixing up a few offsets.When dsymutil deduplicates types it changes local CU reference into inter-CU reference(so that CU2(next) could reference type definition from CU1(prev)). To do this change it does not need to do any fixups currently. When dsymutil meets already existed(located in the input object file) inter-CU reference pointing into the CU which has not been processed yet(and then its offset is unknown) it marks it as "forward reference" and patches later during additional pass "fixup forward references" at a time when offsets are known. If CUs would be processed in parallel their offsets would not be known at the moment when local type reference would be changed into inter-CU reference. So we would need to do the same fix-up processing for all references to the types like we already do for other inter-CU references.> Without loading all CU into the memory it would require two passes > solution. First to analyze > which part of DWARF relates to live code and then second pass to > generate the result. > > Not sure it'd require any more second pass than a "fixup" pass, which > it sounds like you're saying it already has?It looks like it would need an additional pass to process inter-CU references(existed in incoming file) if we do not want to load all CUs into memory. When the input file contains inter-CU references, DWARFLinker needs to follow them while doing liveness marking. i.e. if the original CU has a live part which references another CU we need to follow this new CU and mark the referenced part as life. At the current moment, while doing liveness analysis, we have all CUs in memory. That allows us to load all CUs once and analyze them all. In case llvm-dwarfutil(which loads all DWARF for input file) it leads to huge memory usage. Let's say CU1 references CU100. And CU100 references CU1. We could not start generation for CU1 until we analyzed CU100 and marked the corresponding part of CU1 as life. At the same time, we could not load DWARF for all CUs. Then processing(in simplified form) could look like this: 1: for (CU : CU1...CU100) load CU, do liveness analysis, remember references, unload CU 2: for (all references) load CU, do liveness analysis, unload CU 3: for (CU : CU1...CU100) load CU, clone CU That is a simplified scheme, but I think it is enough to show the idea. In this scheme we have 1 and 2 which should be done before 3. Alexey.> If we would have a two passes solution then we could create a > compilation unit with all > types at first pass and at the second pass we could generate > result with correct offsets(no > need to fix up them as it is currently required by dsymutil for > forward inter-CU references). > The open question currently: how expensive this two passes > approach is. > > Thank you, Alexey. > >> Anyway, that might be the next step after multi-thread >> DWARFLinker would be ready. >> >> Yep, be interesting to see how it all goes! >> >>> >>> Do you suggest that 0x0000011b should be transformed >>> into something like that: >>> >>> 0x000000fc: DW_TAG_compile_unit >>> DW_AT_language (DW_LANG_C_plus_plus) >>> DW_AT_name ("templ.cpp") >>> DW_AT_stmt_list (0x00000090) >>> DW_AT_low_pc (0x0000000100000fa0) >>> DW_AT_high_pc (0x0000000100000fab) >>> >>> 0x0000011b: DW_TAG_structure_type >>> DW_AT_specification (0x0000002a "x") >>> >>> 0x00000124: DW_TAG_subprogram >>> DW_AT_linkage_name ("_ZN1x2f3IiEEiv") >>> DW_AT_name ("f3<int>") >>> DW_AT_type (0x000000000000005e "int") >>> DW_AT_declaration (true) >>> DW_AT_external (true) >>> DW_AT_APPLE_optimized (true) >>> 0x00000138: NULL >>> 0x00000139: NULL >>> >>> 0x00000140: DW_TAG_subprogram >>> DW_AT_low_pc (0x0000000100000fa0) >>> DW_AT_high_pc (0x0000000100000fab) >>> DW_AT_specification (0x0000000000000124 "_ZN1x2f3IiEEiv") >>> 0x00000155: NULL >>> >>> Did I correctly get the idea? >>> >>> >>> Yep, more or less. It'd be "safer" if 11b didn't use >>> DW_AT_specification to refer to 2a, but instead was only a >>> completely independent declaration of "x" - that path is >>> already well supported/tested (well, it's the >>> work-in-progress stuff for lldb to support >>> -fno-standalone-debug, but gdb's been consuming DWARF like >>> this for years, Clang and GCC both produce DWARF like this >>> (if the type is "homed" in another file, then Clang/GCC >>> produce DWARF that emits a declaration with just the members >>> needed to define any member functions >>> defined/inlined/referenced in this CU)) for years. >>> >>> But using DW_AT_specification, or maybe some other extension >>> attribute might make the consumers task a bit easier (could >>> do both - use an extension attribute to tie them up, leave >>> DW_AT_declaration/DW_AT_name here for consumers that don't >>> understand the extension attribute) in finding that they're >>> all the same type/pieces of teh same type. >> >> yes. would try this solution. >> >> Thank you, Alexey. >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201027/74385aab/attachment.html>
David Blaikie via llvm-dev
2020-Oct-27 17:32 UTC
[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.
On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at gmail.com> wrote:> > On 26.10.2020 22:38, David Blaikie wrote: > > > > On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at gmail.com> > wrote: > >> >> On 23.10.2020 19:43, David Blaikie wrote: >> >> >>>> >>>> >>> Ah, yeah - that seems like a missed opportunity - duplicating the whole >>> type DIE. LTO does this by making monolithic types - merging all the >>> members from different definitions of the same type into one, but that's >>> maybe too expensive for dsymutil (might still be interesting to know how >>> much more expensive, etc). But I think the other way to go would be to >>> produce a declaration of the type, with the relevant members - and let the >>> DWARF consumer identify this declaration as matching up with the earlier >>> definition. That's the sort of DWARF you get from the non-MachO default >>> -fno-standalone-debug anyway, so it's already pretty well tested/supported >>> (support in lldb's a bit younger/more work-in-progress, admittedly). I >>> wonder how much dsym size there is that could be reduced by such an >>> implementation. >>> >>> I see. Yes, that could be done and I think it would result in noticeable >>> size reduction(I do not know exact numbers at the moment). >>> >>> I work on multi-thread DWARFLinker now and it`s first version will do >>> exactly the same type processing like current dsymutil. >>> >> Yeah, best to keep the behavior the same through that >> >>> Above scheme could be implemented as a next step and it would result in >>> better size reduction(better than current state). >>> >>> But I think the better scheme could be done also and it would result in >>> even bigger size reduction and in faster execution. This scheme is >>> something similar to what you`ve described above: "LTO does - making >>> monolithic types - merging all the members from different definitions of >>> the same type into one". >>> >> I believe the reason that's probably not been done is that it can't be >> streamed - it'd lead to buffering more of the output >> >> yes. The fact that DWARF should be streamed into AsmPrinter complicates >> parallel dwarf generation. In my prototype, I generate >> several resulting files(each for one source compilation unit) and then >> sequentially glue them into the final resulting file. >> > How does that help? Do you use relocations in those intermediate object > files so the DWARF in them can refer across files? > > It does not help with referring across the file. It helps to parallel the > generation of CU bodies. > It is not possible to write two CUs in parallel into AsmPrinter. To make > possible parallel generation I stream them into different AsmPrinters(this > comment is for "I believe the reason that's probably not been done is that > it can't be streamed". which initially was about referring across the file, > but it seems I added another direction). >Oh, I see - thanks for explaining, essentially buffering on-disk.> >> (if two of these expandable types were in one CU - the start of the >> second type couldn't be known until the end because it might keep getting >> pushed later due to expansion of the first type) and/or having to revisit >> all the type references (the offset to the second type wouldn't be known >> until the end - so writing the offsets to refer to the type would have to >> be deferred until then). >> >> That is the second problem: offsets are not known until the end of file. >> dsymutil already has that situation for inter-CU references, so it has >> extra pass to >> fixup offsets. >> > Oh, it does? I figured it was one-pass, and that it only ever refers back > to types in previous CUs? So it doesn't have to go back and do a second > pass. But I guess if sees a declaration of T1 in CU1, then later on sees a > definition of T1 in CU2, does it somehow go back to CU1 and remove the > declaration/make references refer to the definition in CU2? I figured it'd > just leave the declaration and references to it as-is, then add the > definition and use that from CU2 onwards? > > For the processing of the types, it do not go back. > This "I figured it was one-pass, and that it only ever refers back to > types in previous CUs" > and this "I figured it'd just leave the declaration and references to it > as-is, then add the definition and use that from CU2 onwards" are correct. >Great - thanks for explaining/confirming!> > With multi-thread implementation such situation would arise more often >> for type references and so more offsets should be fixed during additional >> pass. >> >> DWARFLinker could create additional artificial compile unit and put all >>> merged types there. Later patch all type references to point into this >>> additional compilation unit. No any bits would be duplicated in that case. >>> The performance improvement could be achieved due to less amount of the >>> copied DWARF and due to the fact that type references could be updated when >>> DWARF is cloned(no need in additional pass for that). >>> >> "later patch all type references to point into this additional >> compilation unit" - that's the additional pass that people are probably >> talking/concerned about. Rewalking all the DWARF. The current dsymutil >> approach, as far as I know, is single pass - it knows the final, absolute >> offset to the type from the moment it emits that type/needs to refer to it. >> >> Right. Current dsymutil approach is single pass. And from that point of >> view, solution >> which you`ve described(to produce a declaration of the type, with the >> relevant members) >> allows to keep that single pass implementation. >> >> But there is a restriction for current dsymutil approach: To process >> inter-CU references >> it needs to load all DWARF into the memory(While it analyzes which part >> of DWARF is live, >> it needs to have all CUs loaded into the memory). >> > All DWARF for a single file (which for dsymutil is mostly a single CU, > except with LTO I guess?), not all DWARF for all inputs in memory at once, > yeah? > > right. In dsymutil case - all DWARF for a single file(not all DWARF for > all inputs in memory at once). > But in llvm-dwarfutil case single file contains DWARF for all original > input object files and it all becomes > loaded into memory. >Yeha, would be great to try to go CU-by-CU.> That leads to huge memory usage. >> It is less important when source is a set of object files(like in >> dsymutil case) and this >> become a real problem for llvm-dwarfutil utility when source is a single >> file(With current >> implementation it needs 30G of memory for compiling clang binary). >> > Yeah, that's where I think you'd need a fixup pass one way or another - > because cross-CU references can mean that when you figure out a new layout > for CU5 (because it has a duplicate type definition of something in CU1) > then you might have to touch CU4 that had an absolute/cross-CU forward > reference to CU5. Once you've got such a fixup pass (if dsymutil already > has one? Which, like I said, I'm confused why it would have one/that > doesn't match my very vague understanding) then I think you could make > dsymutil work on a per-CU basis streaming things out, then fixing up a few > offsets. > > When dsymutil deduplicates types it changes local CU reference into > inter-CU reference(so that CU2(next) could reference type definition from > CU1(prev)). To do this change it does not need to do any fixups currently. > > When dsymutil meets already existed(located in the input object file) > inter-CU reference pointing into the CU which has not been processed > yet(and then its offset is unknown) it marks it as "forward reference" and > patches later during additional pass "fixup forward references" at a time > when offsets are known. >OK, so limited 2 pass system. (does it do that second pass once at the end of the whole dsymutil run, or at the end of each input file? (so if an input file has two CUs and the first CU references a type in the second CU - it could write the first CU with a "forward reference", then write the second CU, then fixup the forward reference - and then go on to the next file and its CUs - this could improve performance by touching recently used memory/disk pages only, rather than going all the way back to the start later on when those pages have become cold)> > If CUs would be processed in parallel their offsets would not be known at > the moment when local type reference would be changed into inter-CU > reference. So we would need to do the same fix-up processing for all > references to the types like we already do for other inter-CU references. >Yeah - though the existence of this second "fixup forward references" system - yeah, could just use it much more generally as you say. Not an extra pass, just the existing second pass but having way more fixups to fixup in that pass.> > Without loading all CU into the memory it would require two passes >> solution. First to analyze >> which part of DWARF relates to live code and then second pass to generate >> the result. >> > Not sure it'd require any more second pass than a "fixup" pass, which it > sounds like you're saying it already has? > > It looks like it would need an additional pass to process inter-CU > references(existed in incoming file) if we do not want to load all CUs into > memory. >Usually inter-CU references aren't used, except in LTO - and in LTO all the DWARF deduplication and function discarding is already done by the IR linker anyway. (ThinLTO is a bit different, but really we'd be better off teaching it the extra tricks anyway (some can't be fixed in ThinLTO - like emitting a "Home" definition of an inline function, only to find out other ThinLTO backend/shards managed to optimize away all uses of the function... so some cleanup may be useful there)). It might be possible to do a more dynamic/rolling cache - keep only the CUs with unresolved cross-CU references alive and only keep them alive until their cross-CU references are found/marked alive. This should make things no worse than the traditional dsymutil case - since cross-CU references are only effective/generally used within a single object file (it's possible to create relocations for them into other files - but I know LLVM doesn't currently do this and I don't think GCC does it) with multiple CUs anyway - so at most you'd keep all the CUs from a single original input file alive together.> When the input file contains inter-CU references, DWARFLinker needs to > follow them while doing liveness marking. i.e. if the original CU has a > live part which references another CU we need to follow this new CU and > mark the referenced part as life. At the current moment, while doing > liveness analysis, we have all CUs in memory. That allows us to load all > CUs once and analyze them all. In case llvm-dwarfutil(which loads all DWARF > for input file) it leads to huge memory usage. > > Let's say CU1 references CU100. And CU100 references CU1. We could not > start generation for CU1 until we analyzed CU100 and marked the > corresponding part of CU1 as life. At the same time, we could not load > DWARF for all CUs. Then processing(in simplified form) could look like this: > > 1: for (CU : CU1...CU100) > load CU, do liveness analysis, remember references, unload CU > > 2: for (all references) > load CU, do liveness analysis, unload CU > > 3: for (CU : CU1...CU100) > load CU, clone CU > > That is a simplified scheme, but I think it is enough to show the idea. In > this scheme we have 1 and 2 which should be done before 3. > > > Alexey. > > If we would have a two passes solution then we could create a compilation >> unit with all >> types at first pass and at the second pass we could generate result with >> correct offsets(no >> need to fix up them as it is currently required by dsymutil for forward >> inter-CU references). >> The open question currently: how expensive this two passes approach is. >> >> Thank you, Alexey. >> >> Anyway, that might be the next step after multi-thread DWARFLinker would >>> be ready. >>> >> Yep, be interesting to see how it all goes! >> >>> >>> >>>> >>>> Do you suggest that 0x0000011b should be transformed into something >>>> like that: >>>> >>>> 0x000000fc: DW_TAG_compile_unit >>>> DW_AT_language (DW_LANG_C_plus_plus) >>>> DW_AT_name ("templ.cpp") >>>> DW_AT_stmt_list (0x00000090) >>>> DW_AT_low_pc (0x0000000100000fa0) >>>> DW_AT_high_pc (0x0000000100000fab) >>>> >>>> 0x0000011b: DW_TAG_structure_type >>>> DW_AT_specification (0x0000002a "x") >>>> >>>> 0x00000124: DW_TAG_subprogram >>>> DW_AT_linkage_name ("_ZN1x2f3IiEEiv") >>>> DW_AT_name ("f3<int>") >>>> DW_AT_type (0x000000000000005e "int") >>>> DW_AT_declaration (true) >>>> DW_AT_external (true) >>>> DW_AT_APPLE_optimized (true) >>>> 0x00000138: NULL >>>> 0x00000139: NULL >>>> >>>> 0x00000140: DW_TAG_subprogram >>>> DW_AT_low_pc (0x0000000100000fa0) >>>> DW_AT_high_pc (0x0000000100000fab) >>>> DW_AT_specification (0x0000000000000124 >>>> "_ZN1x2f3IiEEiv") >>>> 0x00000155: NULL >>>> >>>> Did I correctly get the idea? >>>> >>> >>> Yep, more or less. It'd be "safer" if 11b didn't use DW_AT_specification >>> to refer to 2a, but instead was only a completely independent declaration >>> of "x" - that path is already well supported/tested (well, it's the >>> work-in-progress stuff for lldb to support -fno-standalone-debug, but gdb's >>> been consuming DWARF like this for years, Clang and GCC both produce DWARF >>> like this (if the type is "homed" in another file, then Clang/GCC produce >>> DWARF that emits a declaration with just the members needed to define any >>> member functions defined/inlined/referenced in this CU)) for years. >>> >>> But using DW_AT_specification, or maybe some other extension attribute >>> might make the consumers task a bit easier (could do both - use an >>> extension attribute to tie them up, leave DW_AT_declaration/DW_AT_name here >>> for consumers that don't understand the extension attribute) in finding >>> that they're all the same type/pieces of teh same type. >>> >>> >>> yes. would try this solution. >>> >>> Thank you, Alexey. >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201027/af9b6705/attachment.html>
Alexey Lapshin via llvm-dev
2020-Oct-27 19:34 UTC
[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.
On 27.10.2020 20:32, David Blaikie wrote:> > > On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at gmail.com > <mailto:avl.lapshin at gmail.com>> wrote: > > > On 26.10.2020 22:38, David Blaikie wrote: >> >> >> On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin >> <avl.lapshin at gmail.com <mailto:avl.lapshin at gmail.com>> wrote: >> >> >> On 23.10.2020 19:43, David Blaikie wrote: >>> >>>> >>>> >>>> >>>> Ah, yeah - that seems like a missed opportunity - >>>> duplicating the whole type DIE. LTO does this by making >>>> monolithic types - merging all the members from >>>> different definitions of the same type into one, but >>>> that's maybe too expensive for dsymutil (might still be >>>> interesting to know how much more expensive, etc). But >>>> I think the other way to go would be to produce a >>>> declaration of the type, with the relevant members - >>>> and let the DWARF consumer identify this declaration as >>>> matching up with the earlier definition. That's the >>>> sort of DWARF you get from the non-MachO default >>>> -fno-standalone-debug anyway, so it's already pretty >>>> well tested/supported (support in lldb's a bit >>>> younger/more work-in-progress, admittedly). I wonder >>>> how much dsym size there is that could be reduced by >>>> such an implementation. >>> >>> I see. Yes, that could be done and I think it would >>> result in noticeable size reduction(I do not know exact >>> numbers at the moment). >>> >>> I work on multi-thread DWARFLinker now and it`s first >>> version will do exactly the same type processing like >>> current dsymutil. >>> >>> Yeah, best to keep the behavior the same through that >>> >>> Above scheme could be implemented as a next step and it >>> would result in better size reduction(better than >>> current state). >>> >>> But I think the better scheme could be done also and it >>> would result in even bigger size reduction and in faster >>> execution. This scheme is something similar to what >>> you`ve described above: "LTO does - making monolithic >>> types - merging all the members from different >>> definitions of the same type into one". >>> >>> I believe the reason that's probably not been done is that >>> it can't be streamed - it'd lead to buffering more of the >>> output >> >> yes. The fact that DWARF should be streamed into AsmPrinter >> complicates parallel dwarf generation. In my prototype, I >> generate >> several resulting files(each for one source compilation unit) >> and then sequentially glue them into the final resulting file. >> >> How does that help? Do you use relocations in those intermediate >> object files so the DWARF in them can refer across files? > > It does not help with referring across the file. It helps to > parallel the generation of CU bodies. > It is not possible to write two CUs in parallel into AsmPrinter. > To make possible parallel generation I stream them into different > AsmPrinters(this comment is for "I believe the reason that's > probably not been done is that it can't be streamed". which > initially was about referring across the file, but it seems I > added another direction). > > Oh, I see - thanks for explaining, essentially buffering on-disk. > >> >>> (if two of these expandable types were in one CU - the start >>> of the second type couldn't be known until the end because >>> it might keep getting pushed later due to expansion of the >>> first type) and/or having to revisit all the type references >>> (the offset to the second type wouldn't be known until the >>> end - so writing the offsets to refer to the type would have >>> to be deferred until then). >> >> That is the second problem: offsets are not known until the >> end of file. >> dsymutil already has that situation for inter-CU references, >> so it has extra pass to >> fixup offsets. >> >> Oh, it does? I figured it was one-pass, and that it only ever >> refers back to types in previous CUs? So it doesn't have to go >> back and do a second pass. But I guess if sees a declaration of >> T1 in CU1, then later on sees a definition of T1 in CU2, does it >> somehow go back to CU1 and remove the declaration/make references >> refer to the definition in CU2? I figured it'd just leave the >> declaration and references to it as-is, then add the definition >> and use that from CU2 onwards? > > For the processing of the types, it do not go back. > This "I figured it was one-pass, and that it only ever refers back > to types in previous CUs" > and this "I figured it'd just leave the declaration and references > to it as-is, then add the definition and use that from CU2 > onwards" are correct. > > Great - thanks for explaining/confirming! > > >> With multi-thread implementation such situation would arise >> more often >> for type references and so more offsets should be fixed >> during additional pass. >> >>> DWARFLinker could create additional artificial compile >>> unit and put all merged types there. Later patch all >>> type references to point into this additional >>> compilation unit. No any bits would be duplicated in >>> that case. The performance improvement could be achieved >>> due to less amount of the copied DWARF and due to the >>> fact that type references could be updated when DWARF is >>> cloned(no need in additional pass for that). >>> >>> "later patch all type references to point into this >>> additional compilation unit" - that's the additional pass >>> that people are probably talking/concerned about. Rewalking >>> all the DWARF. The current dsymutil approach, as far as I >>> know, is single pass - it knows the final, absolute offset >>> to the type from the moment it emits that type/needs to >>> refer to it. >> >> Right. Current dsymutil approach is single pass. And from >> that point of view, solution >> which you`ve described(to produce a declaration of the type, >> with the relevant members) >> allows to keep that single pass implementation. >> >> But there is a restriction for current dsymutil approach: To >> process inter-CU references >> it needs to load all DWARF into the memory(While it analyzes >> which part of DWARF is live, >> it needs to have all CUs loaded into the memory). >> >> All DWARF for a single file (which for dsymutil is mostly a >> single CU, except with LTO I guess?), not all DWARF for all >> inputs in memory at once, yeah? > > right. In dsymutil case - all DWARF for a single file(not all > DWARF for all inputs in memory at once). > But in llvm-dwarfutil case single file contains DWARF for all > original input object files and it all becomes > loaded into memory. > > Yeha, would be great to try to go CU-by-CU. > >> That leads to huge memory usage. >> It is less important when source is a set of object >> files(like in dsymutil case) and this >> become a real problem for llvm-dwarfutil utility when source >> is a single file(With current >> implementation it needs 30G of memory for compiling clang >> binary). >> >> Yeah, that's where I think you'd need a fixup pass one way or >> another - because cross-CU references can mean that when you >> figure out a new layout for CU5 (because it has a duplicate type >> definition of something in CU1) then you might have to touch CU4 >> that had an absolute/cross-CU forward reference to CU5. Once >> you've got such a fixup pass (if dsymutil already has one? Which, >> like I said, I'm confused why it would have one/that doesn't >> match my very vague understanding) then I think you could make >> dsymutil work on a per-CU basis streaming things out, then fixing >> up a few offsets. > > When dsymutil deduplicates types it changes local CU reference > into inter-CU reference(so that CU2(next) could reference type > definition from CU1(prev)). To do this change it does not need to > do any fixups currently. > > When dsymutil meets already existed(located in the input object > file) inter-CU reference pointing into the CU which has not been > processed yet(and then its offset is unknown) it marks it as > "forward reference" and patches later during additional pass > "fixup forward references" at a time when offsets are known. > > OK, so limited 2 pass system. (does it do that second pass once at the > end of the whole dsymutil run, or at the end of each input file? (so > if an input file has two CUs and the first CU references a type in the > second CU - it could write the first CU with a "forward reference", > then write the second CU, then fixup the forward reference - and then > go on to the next file and its CUs - this could improve performance by > touching recently used memory/disk pages only, rather than going all > the way back to the start later on when those pages have become cold)yes, It does it in the end of each input file.> > If CUs would be processed in parallel their offsets would not be > known at the moment when local type reference would be changed > into inter-CU reference. So we would need to do the same fix-up > processing for all references to the types like we already do for > other inter-CU references. > > Yeah - though the existence of this second "fixup forward references" > system - yeah, could just use it much more generally as you say. Not > an extra pass, just the existing second pass but having way more > fixups to fixup in that pass.If we would be able to change the algorithm in such way : 1. analyse all CUs. 2. clone all CUs. Then we could create a merged type table(artificial CU containing types) during step1. If that type table would be written first, then all following CUs could use known offsets to the types and we would not need additional fix-up processing for type references. It would still be necessary to fix-up other inter-CU references. But it would not be necessary to fix-up type references (which constitute the vast majority).> >> Without loading all CU into the memory it would require two >> passes solution. First to analyze >> which part of DWARF relates to live code and then second pass >> to generate the result. >> >> Not sure it'd require any more second pass than a "fixup" pass, >> which it sounds like you're saying it already has? > > It looks like it would need an additional pass to process inter-CU > references(existed in incoming file) if we do not want to load all > CUs into memory. > > Usually inter-CU references aren't used, except in LTO - and in LTO > all the DWARF deduplication and function discarding is already done by > the IR linker anyway. (ThinLTO is a bit different, but really we'd be > better off teaching it the extra tricks anyway (some can't be fixed in > ThinLTO - like emitting a "Home" definition of an inline function, > only to find out other ThinLTO backend/shards managed to optimize away > all uses of the function... so some cleanup may be useful there)). It > might be possible to do a more dynamic/rolling cache - keep only the > CUs with unresolved cross-CU references alive and only keep them alive > until their cross-CU references are found/marked alive. This should > make things no worse than the traditional dsymutil case - since > cross-CU references are only effective/generally used within a single > object file (it's possible to create relocations for them into other > files - but I know LLVM doesn't currently do this and I don't think > GCC does it) with multiple CUs anyway - so at most you'd keep all the > CUs from a single original input file alive together.But, since it is a DWARF documented case the tool should be ready for such case(when inter-CU references are heavily used). Moreover, llvm-dwarfutil would be the tool producing exactly such situation. The resulting file(produced by llvm-dwarfutil) would contain a lot of inter-CU references. Probably, there is no practical reasons to apply llvm-dwarfutil to the same file twice but it would be a good test for the tool. Generally, I think we should not assume that inter-CU references would be used in a limited way. Anyway, if this scheme: 1. analyse all CUs. 2. clone all CUs. would work slow then we would need to continue with one-pass solution and not support complex closely coupled inputs. Alexey.> When the input file contains inter-CU references, DWARFLinker > needs to follow them while doing liveness marking. i.e. if the > original CU has a live part which references another CU we need to > follow this new CU and mark the referenced part as life. At the > current moment, while doing liveness analysis, we have all CUs in > memory. That allows us to load all CUs once and analyze them all. > In case llvm-dwarfutil(which loads all DWARF for input file) it > leads to huge memory usage. > > Let's say CU1 references CU100. And CU100 references CU1. We could > not start generation for CU1 until we analyzed CU100 and marked > the corresponding part of CU1 as life. At the same time, we could > not load DWARF for all CUs. Then processing(in simplified form) > could look like this: > > 1: for (CU : CU1...CU100) > load CU, do liveness analysis, remember references, unload CU > > 2: for (all references) > load CU, do liveness analysis, unload CU > > 3: for (CU : CU1...CU100) > load CU, clone CU > > That is a simplified scheme, but I think it is enough to show the > idea. In this scheme we have 1 and 2 which should be done before 3. > > > Alexey. > >> If we would have a two passes solution then we could create a >> compilation unit with all >> types at first pass and at the second pass we could generate >> result with correct offsets(no >> need to fix up them as it is currently required by dsymutil >> for forward inter-CU references). >> The open question currently: how expensive this two passes >> approach is. >> >> Thank you, Alexey. >> >>> Anyway, that might be the next step after multi-thread >>> DWARFLinker would be ready. >>> >>> Yep, be interesting to see how it all goes! >>> >>>> >>>> Do you suggest that 0x0000011b should be >>>> transformed into something like that: >>>> >>>> 0x000000fc: DW_TAG_compile_unit >>>> DW_AT_language (DW_LANG_C_plus_plus) >>>> DW_AT_name ("templ.cpp") >>>> DW_AT_stmt_list (0x00000090) >>>> DW_AT_low_pc (0x0000000100000fa0) >>>> DW_AT_high_pc (0x0000000100000fab) >>>> >>>> 0x0000011b: DW_TAG_structure_type >>>> DW_AT_specification (0x0000002a "x") >>>> >>>> 0x00000124: DW_TAG_subprogram >>>> DW_AT_linkage_name ("_ZN1x2f3IiEEiv") >>>> DW_AT_name ("f3<int>") >>>> DW_AT_type (0x000000000000005e "int") >>>> DW_AT_declaration (true) >>>> DW_AT_external (true) >>>> DW_AT_APPLE_optimized (true) >>>> 0x00000138: NULL >>>> 0x00000139: NULL >>>> >>>> 0x00000140: DW_TAG_subprogram >>>> DW_AT_low_pc (0x0000000100000fa0) >>>> DW_AT_high_pc (0x0000000100000fab) >>>> DW_AT_specification (0x0000000000000124 >>>> "_ZN1x2f3IiEEiv") >>>> 0x00000155: NULL >>>> >>>> Did I correctly get the idea? >>>> >>>> >>>> Yep, more or less. It'd be "safer" if 11b didn't use >>>> DW_AT_specification to refer to 2a, but instead was >>>> only a completely independent declaration of "x" - that >>>> path is already well supported/tested (well, it's the >>>> work-in-progress stuff for lldb to support >>>> -fno-standalone-debug, but gdb's been consuming DWARF >>>> like this for years, Clang and GCC both produce DWARF >>>> like this (if the type is "homed" in another file, then >>>> Clang/GCC produce DWARF that emits a declaration with >>>> just the members needed to define any member functions >>>> defined/inlined/referenced in this CU)) for years. >>>> >>>> But using DW_AT_specification, or maybe some other >>>> extension attribute might make the consumers task a bit >>>> easier (could do both - use an extension attribute to >>>> tie them up, leave DW_AT_declaration/DW_AT_name here >>>> for consumers that don't understand the extension >>>> attribute) in finding that they're all the same >>>> type/pieces of teh same type. >>> >>> yes. would try this solution. >>> >>> Thank you, Alexey. >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201027/1660d8cf/attachment.html>