thr3ads.net - llvm dev - [llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF. [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Alexey Lapshin via llvm-dev

2020-Oct-28 13:01 UTC

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

On 28.10.2020 01:49, David Blaikie wrote:>
>
> On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <avl.lapshin at
gmail.com
> <mailto:avl.lapshin at gmail.com>> wrote:
>
>
>     On 27.10.2020 20:32, David Blaikie wrote:
>>
>>
>>     On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin
>>     <avl.lapshin at gmail.com <mailto:avl.lapshin at
gmail.com>> wrote:
>>
>>
>>         On 26.10.2020 22:38, David Blaikie wrote:
>>>
>>>
>>>         On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin
>>>         <avl.lapshin at gmail.com <mailto:avl.lapshin at
gmail.com>> wrote:
>>>
>>>
>>>             On 23.10.2020 19:43, David Blaikie wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>                 Ah, yeah - that seems like a missed
opportunity -
>>>>>                 duplicating the whole type DIE. LTO does
this by
>>>>>                 making monolithic types - merging all the
members
>>>>>                 from different definitions of the same type
into
>>>>>                 one, but that's maybe too expensive for
dsymutil
>>>>>                 (might still be interesting to know how
much more
>>>>>                 expensive, etc). But I think the other way
to go
>>>>>                 would be to produce a declaration of the
type,
>>>>>                 with the relevant members - and let the
DWARF
>>>>>                 consumer identify this declaration as
matching up
>>>>>                 with the earlier definition. That's the
sort of
>>>>>                 DWARF you get from the non-MachO default
>>>>>                 -fno-standalone-debug anyway, so it's
already
>>>>>                 pretty well tested/supported (support in
lldb's a
>>>>>                 bit younger/more work-in-progress,
admittedly). I
>>>>>                 wonder how much dsym size there is that
could be
>>>>>                 reduced by such an implementation.
>>>>
>>>>                 I see. Yes, that could be done and I think it
would
>>>>                 result in noticeable size reduction(I do not
know
>>>>                 exact numbers at the moment).
>>>>
>>>>                 I work on multi-thread DWARFLinker now and it`s
>>>>                 first version will do exactly the same type
>>>>                 processing like current dsymutil.
>>>>
>>>>             Yeah, best to keep the behavior the same through
that
>>>>
>>>>                 Above scheme could be implemented as a next
step
>>>>                 and it would result in better size
reduction(better
>>>>                 than current state).
>>>>
>>>>                 But I think the better scheme could be done
also
>>>>                 and it would result in even bigger size
reduction
>>>>                 and in faster execution. This scheme is
something
>>>>                 similar to what you`ve described above:
"LTO does -
>>>>                 making monolithic types - merging all the
members
>>>>                 from different definitions of the same type
into one".
>>>>
>>>>             I believe the reason that's probably not been
done is
>>>>             that it can't be streamed - it'd lead to
buffering more
>>>>             of the output
>>>
>>>             yes. The fact that DWARF should be streamed into
>>>             AsmPrinter complicates parallel dwarf generation. In my
>>>             prototype, I generate
>>>             several resulting files(each for one source compilation
>>>             unit) and then sequentially glue them into the final
>>>             resulting file.
>>>
>>>         How does that help? Do you use relocations in those
>>>         intermediate object files so the DWARF in them can refer
>>>         across files?
>>
>>         It does not help with referring across the file. It helps to
>>         parallel the generation of CU bodies.
>>         It is not possible to write two CUs in parallel into
>>         AsmPrinter. To make possible parallel generation I stream
>>         them into different AsmPrinters(this comment is for "I
>>         believe the reason that's probably not been done is that it
>>         can't be streamed". which initially was about
referring
>>         across the file, but it seems I added another direction).
>>
>>     Oh, I see - thanks for explaining, essentially buffering on-disk.
>>
>>>
>>>>             (if two of these expandable types were in one CU -
the
>>>>             start of the second type couldn't be known
until the
>>>>             end because it might keep getting pushed later due
to
>>>>             expansion of the first type) and/or having to
revisit
>>>>             all the type references (the offset to the second
type
>>>>             wouldn't be known until the end - so writing
the
>>>>             offsets to refer to the type would have to be
deferred
>>>>             until then).
>>>
>>>             That is the second problem: offsets are not known until
>>>             the end of file.
>>>             dsymutil already has that situation for inter-CU
>>>             references, so it has extra pass to
>>>             fixup offsets.
>>>
>>>         Oh, it does? I figured it was one-pass, and that it only
>>>         ever refers back to types in previous CUs? So it
doesn't
>>>         have to go back and do a second pass. But I guess if sees a
>>>         declaration of T1 in CU1, then later on sees a definition
of
>>>         T1 in CU2, does it somehow go back to CU1 and remove the
>>>         declaration/make references refer to the definition in CU2?
>>>         I figured it'd just leave the declaration and
references to
>>>         it as-is, then add the definition and use that from CU2
>>>         onwards?
>>
>>         For the processing of the types, it do not go back.
>>         This "I figured it was one-pass, and that it only ever
refers
>>         back to types in previous CUs"
>>         and this "I figured it'd just leave the declaration
and
>>         references to it as-is, then add the definition and use that
>>         from CU2 onwards" are correct.
>>
>>     Great - thanks for explaining/confirming!
>>
>>
>>>             With multi-thread implementation such situation would
>>>             arise more often
>>>             for type references and so more offsets should be fixed
>>>             during additional pass.
>>>
>>>>                 DWARFLinker could create additional artificial
>>>>                 compile unit and put all merged types there.
Later
>>>>                 patch all type references to point into this
>>>>                 additional compilation unit.  No any bits would
be
>>>>                 duplicated in that case. The performance
>>>>                 improvement could be achieved due to less
amount of
>>>>                 the copied DWARF and due to the fact that type
>>>>                 references could be updated when DWARF is
cloned(no
>>>>                 need in additional pass for that).
>>>>
>>>>             "later patch all type references to point into
this
>>>>             additional compilation unit" - that's the
additional
>>>>             pass that people are probably talking/concerned
about.
>>>>             Rewalking all the DWARF. The current dsymutil
approach,
>>>>             as far as I know, is single pass - it knows the
final,
>>>>             absolute offset to the type from the moment it
emits
>>>>             that type/needs to refer to it.
>>>
>>>             Right. Current dsymutil approach is single pass. And
>>>             from that point of view, solution
>>>             which you`ve described(to produce a declaration of the
>>>             type, with the relevant members)
>>>             allows to keep that single pass implementation.
>>>
>>>             But there is a restriction for current dsymutil
>>>             approach: To process inter-CU references
>>>             it needs to load all DWARF into the memory(While it
>>>             analyzes which part of DWARF is live,
>>>             it needs to have all CUs loaded into the memory).
>>>
>>>         All DWARF for a single file (which for dsymutil is mostly a
>>>         single CU, except with LTO I guess?), not all DWARF for all
>>>         inputs in memory at once, yeah?
>>
>>         right. In dsymutil case - all DWARF for a single file(not all
>>         DWARF for all inputs in memory at once).
>>         But in llvm-dwarfutil case single file contains DWARF for all
>>         original input object files and it all becomes
>>         loaded into memory.
>>
>>     Yeha, would be great to try to go CU-by-CU.
>>
>>>             That leads to huge memory usage.
>>>             It is less important when source is a set of object
>>>             files(like in dsymutil case) and this
>>>             become a real problem for llvm-dwarfutil utility when
>>>             source is a single file(With current
>>>             implementation it needs 30G of memory for compiling
>>>             clang binary).
>>>
>>>         Yeah, that's where I think you'd need a fixup pass
one way
>>>         or another - because cross-CU references can mean that when
>>>         you figure out a new layout for CU5 (because it has a
>>>         duplicate type definition of something in CU1) then you
>>>         might have to touch CU4 that had an absolute/cross-CU
>>>         forward reference to CU5. Once you've got such a fixup
pass
>>>         (if dsymutil already has one? Which, like I said, I'm
>>>         confused why it would have one/that doesn't match my
very
>>>         vague understanding) then I think you could make dsymutil
>>>         work on a per-CU basis streaming things out, then fixing up
>>>         a few offsets.
>>
>>         When dsymutil deduplicates types it changes local CU
>>         reference into inter-CU reference(so that CU2(next) could
>>         reference type definition from CU1(prev)). To do this change
>>         it does not need to do any fixups currently.
>>
>>         When dsymutil meets already existed(located in the input
>>         object file) inter-CU reference pointing into the CU which
>>         has not been processed yet(and then its offset is unknown) it
>>         marks it as "forward reference" and patches later
during
>>         additional pass "fixup forward references" at a time
when
>>         offsets are known.
>>
>>     OK, so limited 2 pass system. (does it do that second pass once
>>     at the end of the whole dsymutil run, or at the end of each input
>>     file? (so if an input file has two CUs and the first CU
>>     references a type in the second CU - it could write the first CU
>>     with a "forward reference", then write the second CU,
then fixup
>>     the forward reference - and then go on to the next file and its
>>     CUs - this could improve performance by touching recently used
>>     memory/disk pages only, rather than going all the way back to the
>>     start later on when those pages have become cold)
>
>     yes, It does it in the end of each input file.
>
>
>>
>>         If CUs would be processed in parallel their offsets would not
>>         be known at the moment when local type reference would be
>>         changed into inter-CU reference. So we would need to do the
>>         same fix-up processing for all references to the types like
>>         we already do for other inter-CU references.
>>
>>     Yeah - though the existence of this second "fixup forward
>>     references" system - yeah, could just use it much more
generally
>>     as you say. Not an extra pass, just the existing second pass but
>>     having way more fixups to fixup in that pass.
>     If we would be able to change the algorithm in such way :
>
>     1. analyse all CUs.
>     2. clone all CUs.
>
>     Then we could create a merged type table(artificial CU containing
>     types) during step1.
>     If that type table would be written first, then all following CUs
>     could use known offsets
>     to the types and we would not need additional fix-up processing
>     for type references.
>     It would still be necessary to fix-up other inter-CU references.
>     But it would not be necessary
>     to fix-up type references (which constitute the vast majority).
>
>
> To me, that sounds more expensive than the fixup forward references pass.
If we would speak about direct comparison then yes loading DWARF one 
more time looks more expensive than fixup forward references pass. But 
if we would speak about the general picture then it could probably be 
beneficial:

1. merging types would lead to a smaller size of resulting DWARF. This 
would speed up the process.
    f.e. If we would switch "odr types deduplication" off in current 
implementation then it would increase execution time two times. That is 
because more DWARF should be cloned and written in the result. 
Implementation of "merging types" would probably have a similar effect
    - It would speed-up the overall process. So from one side additional 
step for loading DWARF would
    decrease performance but a smaller amount of resulting data would 
increase performance.

2. When types would be put in the first CU then we would have a simple 
strategy for our liveness analysis algorithm: just always keep the first 
CU in memory. This allows us to speed up our liveness analysis step.

Anyway, all the above is just an idea for future work. Currently, I am 
going to implement multithread processing for CUs loaded into memory and 
having the same type of processing as it currently is(Which assumes that 
"fixup forward references pass" started to do more work by fixing
types
references).

>
>>
>>>             Without loading all CU into the memory it would require
>>>             two passes solution. First to analyze
>>>             which part of DWARF relates to live code and then
second
>>>             pass to generate the result.
>>>
>>>         Not sure it'd require any more second pass than a
"fixup"
>>>         pass, which it sounds like you're saying it already
has?
>>
>>         It looks like it would need an additional pass to process
>>         inter-CU references(existed in incoming file) if we do not
>>         want to load all CUs into memory.
>>
>>     Usually inter-CU references aren't used, except in LTO - and in
>>     LTO all the DWARF deduplication and function discarding is
>>     already done by the IR linker anyway. (ThinLTO is a bit
>>     different, but really we'd be better off teaching it the extra
>>     tricks anyway (some can't be fixed in ThinLTO - like emitting a
>>     "Home" definition of an inline function, only to find out
other
>>     ThinLTO backend/shards managed to optimize away all uses of the
>>     function... so some cleanup may be useful there)). It might be
>>     possible to do a more dynamic/rolling cache - keep only the CUs
>>     with unresolved cross-CU references alive and only keep them
>>     alive until their cross-CU references are found/marked alive.
>>     This should make things no worse than the traditional dsymutil
>>     case - since cross-CU references are only effective/generally
>>     used within a single object file (it's possible to create
>>     relocations for them into other files - but I know LLVM doesn't
>>     currently do this and I don't think GCC does it) with multiple
>>     CUs anyway - so at most you'd keep all the CUs from a single
>>     original input file alive together.
>     But, since it is a DWARF documented case the tool should be ready
>     for such case(when inter-CU
>     references are heavily used).
>
>
> Sure - but by implementing a CU liveness window like that (keeping CUs 
> live only so long as they need to be rather than an all-or-nothing 
> approach) only especially quirky inputs would hit the worst case while 
> the more normal inputs could perform better.
It is not clear what should be put in such CU liveness window. If CU100 
references CU1 - how could we know that we need to put CU1 into CU 
liveness window before we processed CU100?

>     Moreover, llvm-dwarfutil would be the tool producing
>     exactly such situation. The resulting file(produced by
>     llvm-dwarfutil) would contain a lot of
>     inter-CU references. Probably, there is no practical reasons to
>     apply llvm-dwarfutil to the same
>     file twice but it would be a good test for the tool.
>
>
> It'd be a good stress test, but not necessarily something that would 
> need to perform the best because it wouldn't be a common use case.
I agree that we should not slow down the DWARFLinker in common cases 
only because we need to support the worst cases.
But we also need to implement a solution which works in some acceptable 
manner for the worst case. The current solution - loading everything in 
memory - makes it hard to use in a non-dsymutil scenario(llvm-dwarfutil).

There could be several things which could be used to decide whether we 
need to go on a light or heavy path:

1. If the input contains only a single CU we do not need to unload it 
from memory. Thus - we would not need to do an extra DWARF loading pass.
2. If abbreviations from the whole input file do not contain inter-CU 
references then while doing liveness analysis,  we do not need to wait 
until other CUs are processed.

Then that scheme would be used for worst cases:

1. for (CU : CU1...CU100) {
      load CU.
      analyse CU.
      unload CU.
    }
2. for (CU : CU1...CU100) {
      load CU.
      clone CU.
      unload CU.
    }
3. fixup forward references.

and that scheme for light cases:

1. for (CU : CU1...CU100) {
      load CU.
      analyse CU.
      clone CU.
      unload CU.
    }
2. fixup forward references.
>     Generally, I think we should not assume that inter-CU references
>     would be used in a limited way.
>
>     Anyway, if this scheme:
>
>     1. analyse all CUs.
>     2. clone all CUs.
>
>     would work slow then we would need to continue with one-pass
>     solution and not support complex closely coupled inputs.
>
>
> yeah, certainly seeing the data/experiments will be interesting, if 
> you end up implementing some different strategies, etc.
>
> I guess one possibility for parallel generation could be something 
> more like Microsoft's approach with a central debug info server that 
> compilers communicate with - not that exact model, I mean, but if 
> you've got parallel threads generating reduced DWARF into separate 
> object files - they could communicate with a single thread responsible 
> for type emission - the type emitter would be given types from the 
> separate threads and compute their size, queue them up to be streamed 
> out to the type CU (& keep the source CU alive until that work was 
> done) - such a central type emitter could quickly determine the size 
> of the type to be emitted and compute future type offsets (eg: if 5 
> types were in the queue, it could've figured out the offset of those 
> types already) to answer type offset queries quickly and unblock the 
> parallel threads to continue emitting their CUs containing type 
> references.
yes. Thank you. Would think about it.

Alexey.
>
> - Dave-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201028/f8ef4a2d/attachment.html>

David Blaikie via llvm-dev

2020-Oct-28 17:38 UTC

head link

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com>
wrote:
>
> On 28.10.2020 01:49, David Blaikie wrote:
>
>
>
> On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <avl.lapshin at
gmail.com>
> wrote:
>
>>
>> On 27.10.2020 20:32, David Blaikie wrote:
>>
>>
>>
>> On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at
gmail.com>
>> wrote:
>>
>>>
>>> On 26.10.2020 22:38, David Blaikie wrote:
>>>
>>>
>>>
>>> On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at
gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 23.10.2020 19:43, David Blaikie wrote:
>>>>
>>>>
>>>>>>
>>>>>>
>>>>> Ah, yeah - that seems like a missed opportunity -
duplicating the
>>>>> whole type DIE. LTO does this by making monolithic types -
merging all the
>>>>> members from different definitions of the same type into
one, but that's
>>>>> maybe too expensive for dsymutil (might still be
interesting to know how
>>>>> much more expensive, etc). But I think the other way to go
would be to
>>>>> produce a declaration of the type, with the relevant
members - and let the
>>>>> DWARF consumer identify this declaration as matching up
with the earlier
>>>>> definition. That's the sort of DWARF you get from the
non-MachO default
>>>>> -fno-standalone-debug anyway, so it's already pretty
well tested/supported
>>>>> (support in lldb's a bit younger/more work-in-progress,
admittedly). I
>>>>> wonder how much dsym size there is that could be reduced by
such an
>>>>> implementation.
>>>>>
>>>>> I see. Yes, that could be done and I think it would result
in
>>>>> noticeable size reduction(I do not know exact numbers at
the moment).
>>>>>
>>>>> I work on multi-thread DWARFLinker now and it`s first
version will do
>>>>> exactly the same type processing like current dsymutil.
>>>>>
>>>> Yeah, best to keep the behavior the same through that
>>>>
>>>>> Above scheme could be implemented as a next step and it
would result
>>>>> in better size reduction(better than current state).
>>>>>
>>>>> But I think the better scheme could be done also and it
would result
>>>>> in even bigger size reduction and in faster execution. This
scheme is
>>>>> something similar to what you`ve described above: "LTO
does - making
>>>>> monolithic types - merging all the members from different
definitions of
>>>>> the same type into one".
>>>>>
>>>> I believe the reason that's probably not been done is that
it can't be
>>>> streamed - it'd lead to buffering more of the output
>>>>
>>>> yes. The fact that DWARF should be streamed into AsmPrinter
complicates
>>>> parallel dwarf generation. In my prototype, I generate
>>>> several resulting files(each for one source compilation unit)
and then
>>>> sequentially glue them into the final resulting file.
>>>>
>>> How does that help? Do you use relocations in those intermediate
object
>>> files so the DWARF in them can refer across files?
>>>
>>> It does not help with referring across the file. It helps to
parallel
>>> the generation of CU bodies.
>>> It is not possible to write two CUs in parallel into AsmPrinter. To
make
>>> possible parallel generation I stream them into different
AsmPrinters(this
>>> comment is for "I believe the reason that's probably not
been done is that
>>> it can't be streamed". which initially was about referring
across the file,
>>> but it seems I added another direction).
>>>
>> Oh, I see - thanks for explaining, essentially buffering on-disk.
>>
>>>
>>>> (if two of these expandable types were in one CU - the start of
the
>>>> second type couldn't be known until the end because it
might keep getting
>>>> pushed later due to expansion of the first type) and/or having
to revisit
>>>> all the type references (the offset to the second type
wouldn't be known
>>>> until the end - so writing the offsets to refer to the type
would have to
>>>> be deferred until then).
>>>>
>>>> That is the second problem: offsets are not known until the end
of file.
>>>> dsymutil already has that situation for inter-CU references, so
it has
>>>> extra pass to
>>>> fixup offsets.
>>>>
>>> Oh, it does? I figured it was one-pass, and that it only ever
refers
>>> back to types in previous CUs? So it doesn't have to go back
and do a
>>> second pass. But I guess if sees a declaration of T1 in CU1, then
later on
>>> sees a definition of T1 in CU2, does it somehow go back to CU1 and
remove
>>> the declaration/make references refer to the definition in CU2? I
figured
>>> it'd just leave the declaration and references to it as-is,
then add the
>>> definition and use that from CU2 onwards?
>>>
>>> For the processing of the types, it do not go back.
>>> This "I figured it was one-pass, and that it only ever refers
back to
>>> types in previous CUs"
>>> and this "I figured it'd just leave the declaration and
references to it
>>> as-is, then add the definition and use that from CU2 onwards"
are correct.
>>>
>> Great - thanks for explaining/confirming!
>>
>>>
>>> With multi-thread implementation such situation would arise more
often
>>>> for type references and so more offsets should be fixed during
>>>> additional pass.
>>>>
>>>> DWARFLinker could create additional artificial compile unit and
put all
>>>>> merged types there. Later patch all type references to
point into this
>>>>> additional compilation unit.  No any bits would be
duplicated in that case.
>>>>> The performance improvement could be achieved due to less
amount of the
>>>>> copied DWARF and due to the fact that type references could
be updated when
>>>>> DWARF is cloned(no need in additional pass for that).
>>>>>
>>>> "later patch all type references to point into this
additional
>>>> compilation unit" - that's the additional pass that
people are probably
>>>> talking/concerned about. Rewalking all the DWARF. The current
dsymutil
>>>> approach, as far as I know, is single pass - it knows the
final, absolute
>>>> offset to the type from the moment it emits that type/needs to
refer to it.
>>>>
>>>> Right. Current dsymutil approach is single pass. And from that
point of
>>>> view, solution
>>>> which you`ve described(to produce a declaration of the type,
with the
>>>> relevant members)
>>>> allows to keep that single pass implementation.
>>>>
>>>> But there is a restriction for current dsymutil approach: To
process
>>>> inter-CU references
>>>> it needs to load all DWARF into the memory(While it analyzes
which part
>>>> of DWARF is live,
>>>> it needs to have all CUs loaded into the memory).
>>>>
>>> All DWARF for a single file (which for dsymutil is mostly a single
CU,
>>> except with LTO I guess?), not all DWARF for all inputs in memory
at once,
>>> yeah?
>>>
>>> right. In dsymutil case - all DWARF for a single file(not all DWARF
for
>>> all inputs in memory at once).
>>> But in llvm-dwarfutil case single file contains DWARF for all
original
>>> input object files and it all becomes
>>> loaded into memory.
>>>
>> Yeha, would be great to try to go CU-by-CU.
>>
>>> That leads to huge memory usage.
>>>> It is less important when source is a set of object files(like
in
>>>> dsymutil case) and this
>>>> become a real problem for llvm-dwarfutil utility when source is
a
>>>> single file(With current
>>>> implementation it needs 30G of memory for compiling clang
binary).
>>>>
>>> Yeah, that's where I think you'd need a fixup pass one way
or another -
>>> because cross-CU references can mean that when you figure out a new
layout
>>> for CU5 (because it has a duplicate type definition of something in
CU1)
>>> then you might have to touch CU4 that had an absolute/cross-CU
forward
>>> reference to CU5. Once you've got such a fixup pass (if
dsymutil already
>>> has one? Which, like I said, I'm confused why it would have
one/that
>>> doesn't match my very vague understanding) then I think you
could make
>>> dsymutil work on a per-CU basis streaming things out, then fixing
up a few
>>> offsets.
>>>
>>> When dsymutil deduplicates types it changes local CU reference into
>>> inter-CU reference(so that CU2(next) could reference type
definition from
>>> CU1(prev)). To do this change it does not need to do any fixups
currently.
>>>
>>> When dsymutil meets already existed(located in the input object
file)
>>> inter-CU reference pointing into the CU which has not been
processed
>>> yet(and then its offset is unknown) it marks it as "forward
reference" and
>>> patches later during additional pass "fixup forward
references" at a time
>>> when offsets are known.
>>>
>> OK, so limited 2 pass system. (does it do that second pass once at the
>> end of the whole dsymutil run, or at the end of each input file? (so if
an
>> input file has two CUs and the first CU references a type in the second
CU
>> - it could write the first CU with a "forward reference",
then write the
>> second CU, then fixup the forward reference - and then go on to the
next
>> file and its CUs - this could improve performance by touching recently
used
>> memory/disk pages only, rather than going all the way back to the start
>> later on when those pages have become cold)
>>
>> yes, It does it in the end of each input file.
>>
>>
>>
>>> If CUs would be processed in parallel their offsets would not be
known
>>> at the moment when local type reference would be changed into
inter-CU
>>> reference. So we would need to do the same fix-up processing for
all
>>> references to the types like we already do for other inter-CU
references.
>>>
>> Yeah - though the existence of this second "fixup forward
references"
>> system - yeah, could just use it much more generally as you say. Not an
>> extra pass, just the existing second pass but having way more fixups to
>> fixup in that pass.
>>
>> If we would be able to change the algorithm in such way :
>>
>> 1. analyse all CUs.
>> 2. clone all CUs.
>>
>> Then we could create a merged type table(artificial CU containing
types)
>> during step1.
>> If that type table would be written first, then all following CUs could
>> use known offsets
>> to the types and we would not need additional fix-up processing for
type
>> references.
>> It would still be necessary to fix-up other inter-CU references. But it
>> would not be necessary
>> to fix-up type references (which constitute the vast majority).
>>
>
> To me, that sounds more expensive than the fixup forward references pass.
>
> If we would speak about direct comparison then yes loading DWARF one more
> time looks more expensive than fixup forward references pass. But if we
> would speak about the general picture then it could probably be beneficial:
>
> 1. merging types would lead to a smaller size of resulting DWARF. This
> would speed up the process.
>    f.e. If we would switch "odr types deduplication" off in
current
> implementation then it would increase execution time two times. That is
> because more DWARF should be cloned and written in the result.
> Implementation of "merging types" would probably have a similar
effect
>    - It would speed-up the overall process. So from one side additional
> step for loading DWARF would
>    decrease performance but a smaller amount of resulting data would
> increase performance.
>
> 2. When types would be put in the first CU then we would have a simple
> strategy for our liveness analysis algorithm: just always keep the first CU
> in memory. This allows us to speed up our liveness analysis step.
>
> Anyway, all the above is just an idea for future work. Currently, I am
> going to implement multithread processing for CUs loaded into memory and
> having the same type of processing as it currently is(Which assumes that
> "fixup forward references pass" started to do more work by fixing
types
> references).
>
>
>
>
>>
>>
>>> Without loading all CU into the memory it would require two passes
>>>> solution. First to analyze
>>>> which part of DWARF relates to live code and then second pass
to
>>>> generate the result.
>>>>
>>> Not sure it'd require any more second pass than a
"fixup" pass, which it
>>> sounds like you're saying it already has?
>>>
>>> It looks like it would need an additional pass to process inter-CU
>>> references(existed in incoming file) if we do not want to load all
CUs into
>>> memory.
>>>
>> Usually inter-CU references aren't used, except in LTO - and in LTO
all
>> the DWARF deduplication and function discarding is already done by the
IR
>> linker anyway. (ThinLTO is a bit different, but really we'd be
better off
>> teaching it the extra tricks anyway (some can't be fixed in ThinLTO
- like
>> emitting a "Home" definition of an inline function, only to
find out other
>> ThinLTO backend/shards managed to optimize away all uses of the
function...
>> so some cleanup may be useful there)). It might be possible to do a
more
>> dynamic/rolling cache - keep only the CUs with unresolved cross-CU
>> references alive and only keep them alive until their cross-CU
references
>> are found/marked alive. This should make things no worse than the
>> traditional dsymutil case - since cross-CU references are only
>> effective/generally used within a single object file (it's possible
to
>> create relocations for them into other files - but I know LLVM
doesn't
>> currently do this and I don't think GCC does it) with multiple CUs
anyway -
>> so at most you'd keep all the CUs from a single original input file
alive
>> together.
>>
>> But, since it is a DWARF documented case the tool should be ready for
>> such case(when inter-CU
>> references are heavily used).
>>
>
> Sure - but by implementing a CU liveness window like that (keeping CUs
> live only so long as they need to be rather than an all-or-nothing
> approach) only especially quirky inputs would hit the worst case while the
> more normal inputs could perform better.
>
> It is not clear what should be put in such CU liveness window. If CU100
> references CU1 - how could we know that we need to put CU1 into CU liveness
> window before we processed CU100?
>Fair point, not just forward references to worry about but backward
references too. I wonder how much savings there is in the liveness analysis
compared to "keep one copy of everything, no matter whether it's live
or
not", then it can be a pure forward progress situation. (with the quirk
that you might emit a declaration for an entity once, then a definition for
it later - alternatively if a declaration is seen it could be skipped under
the assumption that a definition will follow (& use a forward ref fixup) -
and if none is found, splat some stub declarations into a trailing CU at
the end)
>
>
>
>> Moreover, llvm-dwarfutil would be the tool producing
>> exactly such situation. The resulting file(produced by llvm-dwarfutil)
>> would contain a lot of
>> inter-CU references. Probably, there is no practical reasons to apply
>> llvm-dwarfutil to the same
>> file twice but it would be a good test for the tool.
>>
>
> It'd be a good stress test, but not necessarily something that would
need
> to perform the best because it wouldn't be a common use case.
>
> I agree that we should not slow down the DWARFLinker in common cases only
> because we need to support the worst cases.
> But we also need to implement a solution which works in some acceptable
> manner for the worst case.
>I think that depends on "acceptable" - correct, yes. Practical to run
in
reasonable time/memory? Not necessarily, in my opinion.
> The current solution - loading everything in memory - makes it hard to use
> in a non-dsymutil scenario(llvm-dwarfutil).
>I agree it's worth exploring the non-dsymutil scenario, as you are - I'm
just saying we don't necessarily need to support high usability (fast/low
memory usage/etc) llvm-dwarfutil on an already dwarfutil'd binary (but as
you've pointed out, the "window" is unknowable because of backward
references, so this whole subthread is perhaps irrelevant).
> There could be several things which could be used to decide whether we
> need to go on a light or heavy path:
>
> 1. If the input contains only a single CU we do not need to unload it from
> memory. Thus - we would not need to do an extra DWARF loading pass.
> 2. If abbreviations from the whole input file do not contain inter-CU
> references then while doing liveness analysis,  we do not need to wait
> until other CUs are processed.
>(2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd still
wonder if a more general implementation strategy could be found that would
make it easier to get a sliding scale of efficiency depending on how much
inter-CU references where were, not a "if there are none it's good, if
there are any it's bad or otherwise very different to implement".
>
> Then that scheme would be used for worst cases:
>
> 1. for (CU : CU1...CU100) {
>      load CU.
>      analyse CU.
>      unload CU.
>    }
> 2. for (CU : CU1...CU100) {
>      load CU.
>      clone CU.
>      unload CU.
>    }
> 3. fixup forward references.
>
> and that scheme for light cases:
>
> 1. for (CU : CU1...CU100) {
>      load CU.
>      analyse CU.
>      clone CU.
>      unload CU.
>    }
> 2. fixup forward references.
>
>
>
>> Generally, I think we should not assume that inter-CU references would
be
>> used in a limited way.
>>
>> Anyway, if this scheme:
>>
>> 1. analyse all CUs.
>> 2. clone all CUs.
>>
>> would work slow then we would need to continue with one-pass solution
and
>> not support complex closely coupled inputs.
>>
>
> yeah, certainly seeing the data/experiments will be interesting, if you
> end up implementing some different strategies, etc.
>
> I guess one possibility for parallel generation could be something more
> like Microsoft's approach with a central debug info server that
compilers
> communicate with - not that exact model, I mean, but if you've got
parallel
> threads generating reduced DWARF into separate object files - they could
> communicate with a single thread responsible for type emission - the type
> emitter would be given types from the separate threads and compute their
> size, queue them up to be streamed out to the type CU (& keep the
source CU
> alive until that work was done) - such a central type emitter could quickly
> determine the size of the type to be emitted and compute future type
> offsets (eg: if 5 types were in the queue, it could've figured out the
> offset of those types already) to answer type offset queries quickly and
> unblock the parallel threads to continue emitting their CUs containing type
> references.
>
> yes. Thank you. Would think about it.
>
> Alexey.
>
>
> - Dave
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201028/c400d7e3/attachment.html>

Alexey Lapshin via llvm-dev

2020-Oct-29 07:50 UTC

head link

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

On 28.10.2020 20:38, David Blaikie wrote:>
>
> On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com
> <mailto:avl.lapshin at gmail.com>> wrote:
>
>
>     On 28.10.2020 01:49, David Blaikie wrote:
>>
>>
>>>
>>>>                 Without loading all CU into the memory it would
>>>>                 require two passes solution. First to analyze
>>>>                 which part of DWARF relates to live code and
then
>>>>                 second pass to generate the result.
>>>>
>>>>             Not sure it'd require any more second pass than
a
>>>>             "fixup" pass, which it sounds like
you're saying it
>>>>             already has?
>>>
>>>             It looks like it would need an additional pass to
>>>             process inter-CU references(existed in incoming file)
if
>>>             we do not want to load all CUs into memory.
>>>
>>>         Usually inter-CU references aren't used, except in LTO
- and
>>>         in LTO all the DWARF deduplication and function discarding
>>>         is already done by the IR linker anyway. (ThinLTO is a bit
>>>         different, but really we'd be better off teaching it
the
>>>         extra tricks anyway (some can't be fixed in ThinLTO -
like
>>>         emitting a "Home" definition of an inline
function, only to
>>>         find out other ThinLTO backend/shards managed to optimize
>>>         away all uses of the function... so some cleanup may be
>>>         useful there)). It might be possible to do a more
>>>         dynamic/rolling cache - keep only the CUs with unresolved
>>>         cross-CU references alive and only keep them alive until
>>>         their cross-CU references are found/marked alive. This
>>>         should make things no worse than the traditional dsymutil
>>>         case - since cross-CU references are only
>>>         effective/generally used within a single object file
(it's
>>>         possible to create relocations for them into other files -
>>>         but I know LLVM doesn't currently do this and I
don't think
>>>         GCC does it) with multiple CUs anyway - so at most
you'd
>>>         keep all the CUs from a single original input file alive
>>>         together.
>>         But, since it is a DWARF documented case the tool should be
>>         ready for such case(when inter-CU
>>         references are heavily used).
>>
>>
>>     Sure - but by implementing a CU liveness window like that
>>     (keeping CUs live only so long as they need to be rather than an
>>     all-or-nothing approach) only especially quirky inputs would hit
>>     the worst case while the more normal inputs could perform better.
>
>     It is not clear what should be put in such CU liveness window. If
>     CU100 references CU1 - how could we know that we need to put CU1
>     into CU liveness window before we processed CU100?
>
> Fair point, not just forward references to worry about but backward 
> references too. I wonder how much savings there is in the liveness 
> analysis compared to "keep one copy of everything, no matter whether 
> it's live or not", then it can be a pure forward progress
situation.
> (with the quirk that you might emit a declaration for an entity once, 
> then a definition for it later - alternatively if a declaration is 
> seen it could be skipped under the assumption that a definition will 
> follow (& use a forward ref fixup) - and if none is found, splat some 
> stub declarations into a trailing CU at the end)That should probably be measured, but I think we would loss most of size 
reduction
(since we would start keep unreferenced data which is currently removed).
Which would lead to slowdown performance and bigger disk space usage.

>
>>         Moreover, llvm-dwarfutil would be the tool producing
>>         exactly such situation. The resulting file(produced by
>>         llvm-dwarfutil) would contain a lot of
>>         inter-CU references. Probably, there is no practical reasons
>>         to apply llvm-dwarfutil to the same
>>         file twice but it would be a good test for the tool.
>>
>>
>>     It'd be a good stress test, but not necessarily something that
>>     would need to perform the best because it wouldn't be a common
>>     use case.
>
>     I agree that we should not slow down the DWARFLinker in common
>     cases only because we need to support the worst cases.
>     But we also need to implement a solution which works in some
>     acceptable manner for the worst case.
>
> I think that depends on "acceptable" - correct, yes. Practical to
run
> in reasonable time/memory? Not necessarily, in my opinion.
>
>     The current solution - loading everything in memory - makes it
>     hard to use in a non-dsymutil scenario(llvm-dwarfutil).
>
> I agree it's worth exploring the non-dsymutil scenario, as you are - 
> I'm just saying we don't necessarily need to support high usability
> (fast/low memory usage/etc) llvm-dwarfutil on an already dwarfutil'd 
> binary (but as you've pointed out, the "window" is unknowable
because
> of backward references, so this whole subthread is perhaps irrelevant).
>
>     There could be several things which could be used to decide
>     whether we need to go on a light or heavy path:
>
>     1. If the input contains only a single CU we do not need to unload
>     it from memory. Thus - we would not need to do an extra DWARF
>     loading pass.
>     2. If abbreviations from the whole input file do not contain
>     inter-CU references then while doing liveness analysis,  we do not
>     need to wait until other CUs are processed.
>
> (2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd 
> still wonder if a more general implementation strategy could be found 
> that would make it easier to get a sliding scale of efficiency 
> depending on how much inter-CU references where were, not a "if there 
> are none it's good, if there are any it's bad or otherwise very 
> different to implement".At the current point, I do not see how that could be done.
One possibility is preliminary mark CU by IsReferenced flag.
Then we could delay cloning for such CU(either by putting into
CU liveness window/either by unloading).
Not referenced CU could be cloned immediately. Such a solution would be
more scalable and work well in cases when only a few inter-CU references
exist. Though it requires changes in DWARF format.

>
>     Then that scheme would be used for worst cases:
>
>     1. for (CU : CU1...CU100) {
>          load CU.
>          analyse CU.
>          unload CU.
>        }
>     2. for (CU : CU1...CU100) {
>          load CU.
>          clone CU.
>          unload CU.
>        }
>     3. fixup forward references.
>
>     and that scheme for light cases:
>
>     1. for (CU : CU1...CU100) {
>          load CU.
>          analyse CU.
>          clone CU.
>          unload CU.
>        }
>     2. fixup forward references.
>
>>         Generally, I think we should not assume that inter-CU
>>         references would be used in a limited way.
>>
>>         Anyway, if this scheme:
>>
>>         1. analyse all CUs.
>>         2. clone all CUs.
>>
>>         would work slow then we would need to continue with one-pass
>>         solution and not support complex closely coupled inputs.
>>
>>
>>     yeah, certainly seeing the data/experiments will be interesting,
>>     if you end up implementing some different strategies, etc.
>>
>>     I guess one possibility for parallel generation could be
>>     something more like Microsoft's approach with a central debug
>>     info server that compilers communicate with - not that exact
>>     model, I mean, but if you've got parallel threads generating
>>     reduced DWARF into separate object files - they could communicate
>>     with a single thread responsible for type emission - the type
>>     emitter would be given types from the separate threads and
>>     compute their size, queue them up to be streamed out to the type
>>     CU (& keep the source CU alive until that work was done) - such
a
>>     central type emitter could quickly determine the size of the type
>>     to be emitted and compute future type offsets (eg: if 5 types
>>     were in the queue, it could've figured out the offset of those
>>     types already) to answer type offset queries quickly and unblock
>>     the parallel threads to continue emitting their CUs containing
>>     type references.
>
>     yes. Thank you. Would think about it.
>
>     Alexey.
>
>>
>>     - Dave
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201029/654145ca/attachment-0001.html>

Alexey Lapshin via llvm-dev

2020-Nov-01 22:05 UTC

head link

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

On 28.10.2020 20:38, David Blaikie wrote:>
>
> On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com
> <mailto:avl.lapshin at gmail.com>> wrote:
>
>
>     On 28.10.2020 01:49, David Blaikie wrote:
>>
>>
>>     On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin
>>     <avl.lapshin at gmail.com <mailto:avl.lapshin at
gmail.com>> wrote:
>>
>>
>>         On 27.10.2020 20:32, David Blaikie wrote:
>>>
>>>
>>>         On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin
>>>         <avl.lapshin at gmail.com <mailto:avl.lapshin at
gmail.com>> wrote:
>>>
>>>
>>>             On 26.10.2020 22:38, David Blaikie wrote:
>>>>
>>>>
>>>>             On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin
>>>>             <avl.lapshin at gmail.com <mailto:avl.lapshin
at gmail.com>>
>>>>             wrote:
>>>>
>>>>
>>>>                 On 23.10.2020 19:43, David Blaikie wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                     Ah, yeah - that seems like a missed
>>>>>>                     opportunity - duplicating the whole
type DIE.
>>>>>>                     LTO does this by making monolithic
types -
>>>>>>                     merging all the members from
different
>>>>>>                     definitions of the same type into
one, but
>>>>>>                     that's maybe too expensive for
dsymutil
>>>>>>                     (might still be interesting to know
how much
>>>>>>                     more expensive, etc). But I think
the other
>>>>>>                     way to go would be to produce a
declaration
>>>>>>                     of the type, with the relevant
members - and
>>>>>>                     let the DWARF consumer identify
this
>>>>>>                     declaration as matching up with the
earlier
>>>>>>                     definition. That's the sort of
DWARF you get
>>>>>>                     from the non-MachO default
>>>>>>                     -fno-standalone-debug anyway, so
it's already
>>>>>>                     pretty well tested/supported
(support in
>>>>>>                     lldb's a bit younger/more
work-in-progress,
>>>>>>                     admittedly). I wonder how much dsym
size
>>>>>>                     there is that could be reduced by
such an
>>>>>>                     implementation.
>>>>>
>>>>>                     I see. Yes, that could be done and I
think it
>>>>>                     would result in noticeable size
reduction(I do
>>>>>                     not know exact numbers at the moment).
>>>>>
>>>>>                     I work on multi-thread DWARFLinker now
and
>>>>>                     it`s first version will do exactly the
same
>>>>>                     type processing like current dsymutil.
>>>>>
>>>>>                 Yeah, best to keep the behavior the same
through that
>>>>>
>>>>>                     Above scheme could be implemented as a
next
>>>>>                     step and it would result in better size
>>>>>                     reduction(better than current state).
>>>>>
>>>>>                     But I think the better scheme could be
done
>>>>>                     also and it would result in even bigger
size
>>>>>                     reduction and in faster execution. This
scheme
>>>>>                     is something similar to what you`ve
described
>>>>>                     above: "LTO does - making
monolithic types -
>>>>>                     merging all the members from different
>>>>>                     definitions of the same type into
one".
>>>>>
>>>>>                 I believe the reason that's probably
not been done
>>>>>                 is that it can't be streamed - it'd
lead to
>>>>>                 buffering more of the output
>>>>
>>>>                 yes. The fact that DWARF should be streamed
into
>>>>                 AsmPrinter complicates parallel dwarf
generation.
>>>>                 In my prototype, I generate
>>>>                 several resulting files(each for one source
>>>>                 compilation unit) and then sequentially glue
them
>>>>                 into the final resulting file.
>>>>
>>>>             How does that help? Do you use relocations in those
>>>>             intermediate object files so the DWARF in them can
>>>>             refer across files?
>>>
>>>             It does not help with referring across the file. It
>>>             helps to parallel the generation of CU bodies.
>>>             It is not possible to write two CUs in parallel into
>>>             AsmPrinter. To make possible parallel generation I
>>>             stream them into different AsmPrinters(this comment is
>>>             for "I believe the reason that's probably not
been done
>>>             is that it can't be streamed". which initially
was about
>>>             referring across the file, but it seems I added another
>>>             direction).
>>>
>>>         Oh, I see - thanks for explaining, essentially buffering
>>>         on-disk.
>>>
>>>>
>>>>>                 (if two of these expandable types were in
one CU -
>>>>>                 the start of the second type couldn't
be known
>>>>>                 until the end because it might keep getting
pushed
>>>>>                 later due to expansion of the first type)
and/or
>>>>>                 having to revisit all the type references
(the
>>>>>                 offset to the second type wouldn't be
known until
>>>>>                 the end - so writing the offsets to refer
to the
>>>>>                 type would have to be deferred until then).
>>>>
>>>>                 That is the second problem: offsets are not
known
>>>>                 until the end of file.
>>>>                 dsymutil already has that situation for
inter-CU
>>>>                 references, so it has extra pass to
>>>>                 fixup offsets.
>>>>
>>>>             Oh, it does? I figured it was one-pass, and that it
>>>>             only ever refers back to types in previous CUs? So
it
>>>>             doesn't have to go back and do a second pass.
But I
>>>>             guess if sees a declaration of T1 in CU1, then
later on
>>>>             sees a definition of T1 in CU2, does it somehow go
back
>>>>             to CU1 and remove the declaration/make references
refer
>>>>             to the definition in CU2? I figured it'd just
leave the
>>>>             declaration and references to it as-is, then add
the
>>>>             definition and use that from CU2 onwards?
>>>
>>>             For the processing of the types, it do not go back.
>>>             This "I figured it was one-pass, and that it only
ever
>>>             refers back to types in previous CUs"
>>>             and this "I figured it'd just leave the
declaration and
>>>             references to it as-is, then add the definition and use
>>>             that from CU2 onwards" are correct.
>>>
>>>         Great - thanks for explaining/confirming!
>>>
>>>
>>>>                 With multi-thread implementation such situation
>>>>                 would arise more often
>>>>                 for type references and so more offsets should
be
>>>>                 fixed during additional pass.
>>>>
>>>>>                     DWARFLinker could create additional
artificial
>>>>>                     compile unit and put all merged types
there.
>>>>>                     Later patch all type references to
point into
>>>>>                     this additional compilation unit.  No
any bits
>>>>>                     would be duplicated in that case. The
>>>>>                     performance improvement could be
achieved due
>>>>>                     to less amount of the copied DWARF and
due to
>>>>>                     the fact that type references could be
updated
>>>>>                     when DWARF is cloned(no need in
additional
>>>>>                     pass for that).
>>>>>
>>>>>                 "later patch all type references to
point into
>>>>>                 this additional compilation unit" -
that's the
>>>>>                 additional pass that people are probably
>>>>>                 talking/concerned about. Rewalking all the
DWARF.
>>>>>                 The current dsymutil approach, as far as I
know,
>>>>>                 is single pass - it knows the final,
absolute
>>>>>                 offset to the type from the moment it emits
that
>>>>>                 type/needs to refer to it.
>>>>
>>>>                 Right. Current dsymutil approach is single
pass.
>>>>                 And from that point of view, solution
>>>>                 which you`ve described(to produce a declaration
of
>>>>                 the type, with the relevant members)
>>>>                 allows to keep that single pass implementation.
>>>>
>>>>                 But there is a restriction for current dsymutil
>>>>                 approach: To process inter-CU references
>>>>                 it needs to load all DWARF into the
memory(While it
>>>>                 analyzes which part of DWARF is live,
>>>>                 it needs to have all CUs loaded into the
memory).
>>>>
>>>>             All DWARF for a single file (which for dsymutil is
>>>>             mostly a single CU, except with LTO I guess?), not
all
>>>>             DWARF for all inputs in memory at once, yeah?
>>>
>>>             right. In dsymutil case - all DWARF for a single
>>>             file(not all DWARF for all inputs in memory at once).
>>>             But in llvm-dwarfutil case single file contains DWARF
>>>             for all original input object files and it all becomes
>>>             loaded into memory.
>>>
>>>         Yeha, would be great to try to go CU-by-CU.
>>>
>>>>                 That leads to huge memory usage.
>>>>                 It is less important when source is a set of
object
>>>>                 files(like in dsymutil case) and this
>>>>                 become a real problem for llvm-dwarfutil
utility
>>>>                 when source is a single file(With current
>>>>                 implementation it needs 30G of memory for
compiling
>>>>                 clang binary).
>>>>
>>>>             Yeah, that's where I think you'd need a
fixup pass one
>>>>             way or another - because cross-CU references can
mean
>>>>             that when you figure out a new layout for CU5
(because
>>>>             it has a duplicate type definition of something in
CU1)
>>>>             then you might have to touch CU4 that had an
>>>>             absolute/cross-CU forward reference to CU5. Once
you've
>>>>             got such a fixup pass (if dsymutil already has one?
>>>>             Which, like I said, I'm confused why it would
have
>>>>             one/that doesn't match my very vague
understanding)
>>>>             then I think you could make dsymutil work on a
per-CU
>>>>             basis streaming things out, then fixing up a few
offsets.
>>>
>>>             When dsymutil deduplicates types it changes local CU
>>>             reference into inter-CU reference(so that CU2(next)
>>>             could reference type definition from CU1(prev)). To do
>>>             this change it does not need to do any fixups
currently.
>>>
>>>             When dsymutil meets already existed(located in the
input
>>>             object file) inter-CU reference pointing into the CU
>>>             which has not been processed yet(and then its offset is
>>>             unknown) it marks it as "forward reference"
and patches
>>>             later during additional pass "fixup forward
references"
>>>             at a time when offsets are known.
>>>
>>>         OK, so limited 2 pass system. (does it do that second pass
>>>         once at the end of the whole dsymutil run, or at the end of
>>>         each input file? (so if an input file has two CUs and the
>>>         first CU references a type in the second CU - it could
write
>>>         the first CU with a "forward reference", then
write the
>>>         second CU, then fixup the forward reference - and then go
on
>>>         to the next file and its CUs - this could improve
>>>         performance by touching recently used memory/disk pages
>>>         only, rather than going all the way back to the start later
>>>         on when those pages have become cold)
>>
>>         yes, It does it in the end of each input file.
>>
>>
>>>
>>>             If CUs would be processed in parallel their offsets
>>>             would not be known at the moment when local type
>>>             reference would be changed into inter-CU reference. So
>>>             we would need to do the same fix-up processing for all
>>>             references to the types like we already do for other
>>>             inter-CU references.
>>>
>>>         Yeah - though the existence of this second "fixup
forward
>>>         references" system - yeah, could just use it much more
>>>         generally as you say. Not an extra pass, just the existing
>>>         second pass but having way more fixups to fixup in that
pass.
>>         If we would be able to change the algorithm in such way :
>>
>>         1. analyse all CUs.
>>         2. clone all CUs.
>>
>>         Then we could create a merged type table(artificial CU
>>         containing types) during step1.
>>         If that type table would be written first, then all following
>>         CUs could use known offsets
>>         to the types and we would not need additional fix-up
>>         processing for type references.
>>         It would still be necessary to fix-up other inter-CU
>>         references. But it would not be necessary
>>         to fix-up type references (which constitute the vast majority).
>>
>>
>>     To me, that sounds more expensive than the fixup forward
>>     references pass.
>
>     If we would speak about direct comparison then yes loading DWARF
>     one more time looks more expensive than fixup forward references
>     pass. But if we would speak about the general picture then it
>     could probably be beneficial:
>
>     1. merging types would lead to a smaller size of resulting DWARF.
>     This would speed up the process.
>        f.e. If we would switch "odr types deduplication" off in
>     current implementation then it would increase execution time two
>     times. That is because more DWARF should be cloned and written in
>     the result. Implementation of "merging types" would probably
have
>     a similar effect
>        - It would speed-up the overall process. So from one side
>     additional step for loading DWARF would
>        decrease performance but a smaller amount of resulting data
>     would increase performance.
>
>     2. When types would be put in the first CU then we would have a
>     simple strategy for our liveness analysis algorithm: just always
>     keep the first CU in memory. This allows us to speed up our
>     liveness analysis step.
>
>     Anyway, all the above is just an idea for future work. Currently,
>     I am going to implement multithread processing for CUs loaded into
>     memory and having the same type of processing as it currently
>     is(Which assumes that "fixup forward references pass" started
to
>     do more work by fixing types references).
>
>
>>
>>>
>>>>                 Without loading all CU into the memory it would
>>>>                 require two passes solution. First to analyze
>>>>                 which part of DWARF relates to live code and
then
>>>>                 second pass to generate the result.
>>>>
>>>>             Not sure it'd require any more second pass than
a
>>>>             "fixup" pass, which it sounds like
you're saying it
>>>>             already has?
>>>
>>>             It looks like it would need an additional pass to
>>>             process inter-CU references(existed in incoming file)
if
>>>             we do not want to load all CUs into memory.
>>>
>>>         Usually inter-CU references aren't used, except in LTO
- and
>>>         in LTO all the DWARF deduplication and function discarding
>>>         is already done by the IR linker anyway. (ThinLTO is a bit
>>>         different, but really we'd be better off teaching it
the
>>>         extra tricks anyway (some can't be fixed in ThinLTO -
like
>>>         emitting a "Home" definition of an inline
function, only to
>>>         find out other ThinLTO backend/shards managed to optimize
>>>         away all uses of the function... so some cleanup may be
>>>         useful there)). It might be possible to do a more
>>>         dynamic/rolling cache - keep only the CUs with unresolved
>>>         cross-CU references alive and only keep them alive until
>>>         their cross-CU references are found/marked alive. This
>>>         should make things no worse than the traditional dsymutil
>>>         case - since cross-CU references are only
>>>         effective/generally used within a single object file
(it's
>>>         possible to create relocations for them into other files -
>>>         but I know LLVM doesn't currently do this and I
don't think
>>>         GCC does it) with multiple CUs anyway - so at most
you'd
>>>         keep all the CUs from a single original input file alive
>>>         together.
>>         But, since it is a DWARF documented case the tool should be
>>         ready for such case(when inter-CU
>>         references are heavily used).
>>
>>
>>     Sure - but by implementing a CU liveness window like that
>>     (keeping CUs live only so long as they need to be rather than an
>>     all-or-nothing approach) only especially quirky inputs would hit
>>     the worst case while the more normal inputs could perform better.
>
>     It is not clear what should be put in such CU liveness window. If
>     CU100 references CU1 - how could we know that we need to put CU1
>     into CU liveness window before we processed CU100?
>
> Fair point, not just forward references to worry about but backward 
> references too. I wonder how much savings there is in the liveness 
> analysis compared to "keep one copy of everything, no matter whether 
> it's live or not", then it can be a pure forward progress
situation.
> (with the quirk that you might emit a declaration for an entity once, 
> then a definition for it later - alternatively if a declaration is 
> seen it could be skipped under the assumption that a definition will 
> follow (& use a forward ref fixup) - and if none is found, splat some 
> stub declarations into a trailing CU at the end)
>
>
>>         Moreover, llvm-dwarfutil would be the tool producing
>>         exactly such situation. The resulting file(produced by
>>         llvm-dwarfutil) would contain a lot of
>>         inter-CU references. Probably, there is no practical reasons
>>         to apply llvm-dwarfutil to the same
>>         file twice but it would be a good test for the tool.
>>
>>
>>     It'd be a good stress test, but not necessarily something that
>>     would need to perform the best because it wouldn't be a common
>>     use case.
>
>     I agree that we should not slow down the DWARFLinker in common
>     cases only because we need to support the worst cases.
>     But we also need to implement a solution which works in some
>     acceptable manner for the worst case.
>
> I think that depends on "acceptable" - correct, yes. Practical to
run
> in reasonable time/memory? Not necessarily, in my opinion.
>
>     The current solution - loading everything in memory - makes it
>     hard to use in a non-dsymutil scenario(llvm-dwarfutil).
>
> I agree it's worth exploring the non-dsymutil scenario, as you are - 
> I'm just saying we don't necessarily need to support high usability
> (fast/low memory usage/etc) llvm-dwarfutil on an already dwarfutil'd 
> binary (but as you've pointed out, the "window" is unknowable
because
> of backward references, so this whole subthread is perhaps irrelevant).
>
>     There could be several things which could be used to decide
>     whether we need to go on a light or heavy path:
>
>     1. If the input contains only a single CU we do not need to unload
>     it from memory. Thus - we would not need to do an extra DWARF
>     loading pass.
>     2. If abbreviations from the whole input file do not contain
>     inter-CU references then while doing liveness analysis,  we do not
>     need to wait until other CUs are processed.
>
> (2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd 
> still wonder if a more general implementation strategy could be found 
> that would make it easier to get a sliding scale of efficiency 
> depending on how much inter-CU references where were, not a "if there 
> are none it's good, if there are any it's bad or otherwise very 
> different to implement".
I think, there is a scenario which would make it possible to process CU 
once for not referenced CUs and handle inter-CU references in a scalable 
way(even for dwarfutil`d binary):

1. Implement a global type's table and types merging. This allows us to 
have all types in the memory.
    Then, all inter-CU type references would point into that memory type 
table.
    (we do not know which CU should be put into CU liveness window, we 
also could not put all CUs into the memory, but we could put all types 
into the memory).

2. If there are not other inter-CU references then all CUs would be 
handled by one pass.

3. If there are other inter-CU references, then after all CU processed 
by the first pass we would have a list of referenced CUs. Then, we could 
delete already cloned data(for referenced CU) and start the process again:
    load CU, mark liveness, clone data. This second pass would be done 
for only referenced CUs.
    For not-complex, not closely coupled cases it would work relatively 
fast.

4. put memory type table into artificial CU. Update all type`s references.

>
>     Then that scheme would be used for worst cases:
>
>     1. for (CU : CU1...CU100) {
>          load CU.
>          analyse CU.
>          unload CU.
>        }
>     2. for (CU : CU1...CU100) {
>          load CU.
>          clone CU.
>          unload CU.
>        }
>     3. fixup forward references.
>
>     and that scheme for light cases:
>
>     1. for (CU : CU1...CU100) {
>          load CU.
>          analyse CU.
>          clone CU.
>          unload CU.
>        }
>     2. fixup forward references.
>
>>         Generally, I think we should not assume that inter-CU
>>         references would be used in a limited way.
>>
>>         Anyway, if this scheme:
>>
>>         1. analyse all CUs.
>>         2. clone all CUs.
>>
>>         would work slow then we would need to continue with one-pass
>>         solution and not support complex closely coupled inputs.
>>
>>
>>     yeah, certainly seeing the data/experiments will be interesting,
>>     if you end up implementing some different strategies, etc.
>>
>>     I guess one possibility for parallel generation could be
>>     something more like Microsoft's approach with a central debug
>>     info server that compilers communicate with - not that exact
>>     model, I mean, but if you've got parallel threads generating
>>     reduced DWARF into separate object files - they could communicate
>>     with a single thread responsible for type emission - the type
>>     emitter would be given types from the separate threads and
>>     compute their size, queue them up to be streamed out to the type
>>     CU (& keep the source CU alive until that work was done) - such
a
>>     central type emitter could quickly determine the size of the type
>>     to be emitted and compute future type offsets (eg: if 5 types
>>     were in the queue, it could've figured out the offset of those
>>     types already) to answer type offset queries quickly and unblock
>>     the parallel threads to continue emitting their CUs containing
>>     type references.
>
>     yes. Thank you. Would think about it.
>
>     Alexey.
>
>>
>>     - Dave
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201102/d59f28f2/attachment.html>

llvm dev - Nov 2020 - [Proposal][Debuginfo] dsymutil-like tool for ELF.

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.