thr3ads.net - llvm dev - [llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD). [Dec 2017]

If this information is useful, please help other people find it:
Share via:

George Rimar via llvm-dev

2017-Dec-07 12:47 UTC

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

>*nod* That's been the historic ELF+DWARF approach, but both MacOS (with
dsyms+DWARF) and Windows
>(COFF+CodeView+PDB) don't do it that way, and instead involve the linker
to a degree.
>Mostly I'm wondering if it'd be reasonable to (and if anyone would
be interested in doing it) do
>something more like the PDB support - fully debug-aware linking.
Honestly saying I only know how ELF linker works and may be my thoughts below
are silly for some reason or duplicating
some already existent approach. Looking at what .dwp do, looks there are two
main things reducing size debug data:
1) "It must allow for the removal of duplicate type units".
2) "It must allow for the removal of duplicate strings".

Linker already deduplicates strings by itself, though it can delegate it to some
API for debug sections.
And what it could probably do is call some library API. Linker could give it a
some set (or all of)
.debug_* sections so this library would rebuild and optimize the dwarf data,
eliminate duplicates, and
return optimized debug sections back to linker. Then linker would perform
relocations and emit the result to output.

That way library can be used for stand alone post proccessing tool probably
and linker should be able to work with data on a sections level only and be not
DWARF aware.
>Sure - but it works/is supported/is implemented. If someone wants to
implement the newer thing, that's cool, but I don't have any
>personal motivation to do so for example. (& honestly we've been
throwing around some ideas about how to further generalize the
>debug_info contributions to reduce some of the overhead of isolating types -
so maybe if we're lazy enough, we might leapfrog
>this particular state and just implement that future better thing)
I see. Basing on all comments in this thread I am inclined to agree that
implementing newer thing does not make much sence atm.
For now I prepared patch to error out when LLD faces objects with multiple
.debug_* sections for cases when we do not support it.
(D40950). (In LLD we are supporting deduplicating COMDATs, so generally such
object is not a problem as already supported,
but for error reporting purposes and for --gdb-index we assume debug sections
are unique in object,
so in that case we looks want to error out).

Have last thoughts/question about this though :)

Currently clang -gdwarf-5 -fdebug-types-section works. And so linker can
deduplicate types. Though that probably violates
specification saying there is no more .debug_type sections. But behavior is
convinent for users of -fdebug-types-section.
I do not know how transition from v4 to v5 will happen/happens (or how
transition between dwarf standarts usually happens).
I suppose one day clang just will start to produce v5 debug data by default.
And at the same time multiple .debug_info sections mentioned in DWARF5 spec as
an optimization, so it should not be a mandatory
thing to implement. If so it just seems that either we will need to implement
this optimization before switching to v5 by default or allow
-gdwarf-5 -fdebug-types-section to support existent use case. And since it is
already works and already allowed in releases it probably means it is
acceptable to keep (and use) this behavior ? (If so, attempt to leapfrog can be
nice strategy IMO).
>>>>I think Paul covered some of the reasons type units might not be
a reasonable default.
>>>
>>>One additional reason is that if you use Split DWARF (another great
way to massively reduce the amount of debug info going to the linker)
>>>type units are mostly /just/ overhead in the .dwo files: since the
debug info is not linked, there's no opportunity to remove the
>>>duplication anyway (unless you're making a DWP - like a >dsym
file)
>>
>>Yeah. Looks -gsplit-dwarf and -fdebug-types-section are harmfull
together. Probably it worth to restrict using of them together or
>>emit a warning (both clang and gcc silently allows the combination and
output has size penalty you describing).
>
>Nah, only if you're not producing a DWP at the end (
https://gcc.gnu.org/wiki/DebugFissionDWP ).
Sure DWP do great job here it seems, but even for DWP use case flow it does not
look make sence to force compiler to do excessive job
to produce types sections, because DWP producing tools probably should have no
benefit from larger .dwo files with .debug_types at all I think.

I can only imagine now that somebody could use -gsplit-dwarf and
-fdebug-types-section together so that can parse .debug_types.dwo
instead of parsing .debug_info.dwo to look for types in a bit more convinent
way, but that looks too synthetic case.
>In short, I probably wouldn't change any of LLVM's defaults. But
there are certainly flags people can use to reduce their debug info size.
>
>You mentioned starting with this because LLVM's defaults mean the DWARF
is too large to link with DWARF 32 bit? How does gold cope with this?
>I haven't seen failures/error messages/etc from either gold or lld
related to this? (though I mostly use Split DWARF myself)
I posted some results earlier here:
https://bugs.llvm.org//show_bug.cgi?id=31109#c3,
in short: gold 2.26.1 silently ignored this (probably produced broken output),
and
newer versions of gold are able to report and catch the same error.

I think it is simply still not common to have such a large debug sections, we
had only single bug about this so far. And hopefully
DWARF64 can be a solution, though it can just hide the issue, looks would be
nice to reduce amount of debug data we produce still.

Best regards,
George | Developer | Access Softek, Inc

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171207/9e255bad/attachment.html>

George Rimar via llvm-dev

2017-Dec-07 14:56 UTC

head link

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

>>*nod* That's been the historic ELF+DWARF approach, but both MacOS
(with dsyms+DWARF) and Windows
>>(COFF+CodeView+PDB) don't do it that way, and instead involve the
linker to a degree.
>>Mostly I'm wondering if it'd be reasonable to (and if anyone
would be interested in doing it) do
>>something more like the PDB support - fully debug-aware linking.
>
>Honestly saying I only know how ELF linker works and may be my thoughts
below are silly for some reason or duplicating
>some already existent approach. Looking at what .dwp do, looks there are two
main things reducing size debug data:
>1) "It must allow for the removal of duplicate type units".
>2) "It must allow for the removal of duplicate strings".
>
>Linker already deduplicates strings by itself, though it can delegate it to
some API for debug sections.
>And what it could probably do is call some library API. Linker could give it
a some set (or all of)
>.debug_* sections so this library would rebuild and optimize the dwarf data,
eliminate duplicates, and
>return optimized debug sections back to linker. Then linker would perform
relocations and emit the result to output.
>
Though probably resolving relocations can be a problem here. May be linker could
pass already relocated sections for
final optimization/deduplication and some additional information probably, but
anyways I see it can be not that simple now :)

George.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171207/5433af89/attachment.html>

David Blaikie via llvm-dev

2017-Dec-07 20:20 UTC

head link

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

On Thu, Dec 7, 2017 at 4:47 AM George Rimar <grimar at accesssoftek.com>
wrote:
> >*nod* That's been the historic ELF+DWARF approach, but both MacOS
(with
> dsyms+DWARF) and Windows
> >(COFF+CodeView+PDB) don't do it that way, and instead involve the
linker
> to a degree.
> >Mostly I'm wondering if it'd be reasonable to (and if anyone
would be
> interested in doing it) do
> >something more like the PDB support - fully debug-aware linking.
>
> Honestly saying I only know how ELF linker works and may be my thoughts
> below are silly for some reason or duplicating
> some already existent approach. Looking at what .dwp do, looks there are
> two main things reducing size debug data:
> 1) "It must allow for the removal of duplicate type units".
> 2) "It must allow for the removal of duplicate strings".
>
Yeah, DWPs are mostly the same as a linker linking debug info without
knowing anything aobut it. Except instead of relocations, it uses the
cu/tu_index section (& str_index section). Otherwise the DWP packaging tool
doesn't know anything about the debug info (it doesn't need to parse
many
DIEs, etc).

This is still simple/coarse grained compared to Windows PDBs or MacOS dsyms.


> Linker already deduplicates strings by itself, though it can delegate it
> to some API for debug sections.
> And what it could probably do is call some library API. Linker could give
> it a some set (or all of)
> .debug_* sections so this library would rebuild and optimize the dwarf
> data, eliminate duplicates, and
> return optimized debug sections back to linker. Then linker would perform
> relocations and emit the result to output.
>
> That way library can be used for stand alone post proccessing tool probably
> and linker should be able to work with data on a sections level only and
> be not DWARF aware.
>
Postprocessing (ie: running a tool on the fully linked binary with the
debug info we have today, and having the tool reprocess the debug info to
make it more compact) is an option, but wouldn't help address the problem
you started with - that the output can't fit the large offsets, so the
output is invalid/broken. So that output would be broken before the
postprocessing step could run to compact things.

>
> >Sure - but it works/is supported/is implemented. If someone wants to
> implement the newer thing, that's cool, but I don't have any
> >personal motivation to do so for example. (& honestly we've
been throwing
> around some ideas about how to further generalize the
> >debug_info contributions to reduce some of the overhead of isolating
> types - so maybe if we're lazy enough, we might leapfrog
> >this particular state and just implement that future better thing)
>
> I see. Basing on all comments in this thread I am inclined to agree that
> implementing newer thing does not make much sence atm.
> For now I prepared patch to error out when LLD faces objects with multiple
> .debug_* sections for cases when we do not support it.
> (D40950). (In LLD we are supporting deduplicating COMDATs, so generally
> such object is not a problem as already supported,
> but for error reporting purposes and for --gdb-index we assume debug
> sections are unique in object,
> so in that case we looks want to error out).
>
> Have last thoughts/question about this though :)
>
> Currently clang -gdwarf-5 -fdebug-types-section works. And so linker can
> deduplicate types. Though that probably violates
> specification saying there is no more .debug_type sections. But behavior
> is convinent for users of -fdebug-types-section.
> I do not know how transition from v4 to v5 will happen/happens (or how
> transition between dwarf standarts usually happens).
> I suppose one day clang just will start to produce v5 debug data by
> default.
> And at the same time multiple .debug_info sections mentioned in DWARF5
> spec as an optimization, so it should not be a mandatory
> thing to implement. If so it just seems that either we will need to
> implement this optimization before switching to v5 by default or allow
> -gdwarf-5 -fdebug-types-section to support existent use case. And since
> it is already works and already allowed in releases it probably means it is
> acceptable to keep (and use) this behavior ? (If so, attempt to leapfrog
> can be nice strategy IMO).
>
> >>>>I think Paul covered some of the reasons type units might
not be a
> reasonable default.
> >>>
> >>>One additional reason is that if you use Split DWARF (another
great way
> to massively reduce the amount of debug info going to the linker)
> >>>type units are mostly /just/ overhead in the .dwo files: since
the
> debug info is not linked, there's no opportunity to remove the
> >>>duplication anyway (unless you're making a DWP - like a
>dsym file)
> >>
> >>Yeah. Looks -gsplit-dwarf and -fdebug-types-section are harmfull
> together. Probably it worth to restrict using of them together or
> >>emit a warning (both clang and gcc silently allows the combination
and
> output has size penalty you describing).
> >
> >Nah, only if you're not producing a DWP at the end (
> https://gcc.gnu.org/wiki/DebugFissionDWP ).
>
> Sure DWP do great job here it seems, but even for DWP use case flow it
> does not look make sence to force compiler to do excessive job
> to produce types sections, because DWP producing tools probably should
> have no benefit from larger .dwo files with .debug_types at all I think.
>
The current DWP tools (one in binutils, one in LLVM) don't do DWARF-aware
debug info compaction. They just concatenate the sections together,
deduplicate strings, deduplicate type units.

So, yes, to have a smaller DWP file in the end it's beneficial to use type
units (be they in debug_types or debug_info).

But a fancier DWP tool that would process all the DWARF and compact the
result wouldn't need explicit type units & could avoid that overhead.

> I can only imagine now that somebody could use -gsplit-dwarf and
> -fdebug-types-section together so that can parse .debug_types.dwo
> instead of parsing .debug_info.dwo to look for types in a bit more
> convinent way, but that looks too synthetic case.
>
> >In short, I probably wouldn't change any of LLVM's defaults.
But there
> are certainly flags people can use to reduce their debug info size.
> >
> >You mentioned starting with this because LLVM's defaults mean the
DWARF
> is too large to link with DWARF 32 bit? How does gold cope with this?
> >I haven't seen failures/error messages/etc from either gold or lld
> related to this? (though I mostly use Split DWARF myself)
>
> I posted some results earlier here:
> https://bugs.llvm.org//show_bug.cgi?id=31109#c3,
> in short: gold 2.26.1 silently ignored this (probably produced broken
> output), and
> newer versions of gold are able to report and catch the same error.
>
> I think it is simply still not common to have such a large debug sections,
> we had only single bug about this so far. And hopefully
> DWARF64 can be a solution, though it can just hide the issue, looks would
> be nice to reduce amount of debug data we produce still.
>
> Best regards,
> George | Developer | Access Softek, Inc
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171207/720a9f5e/attachment.html>

Eric Christopher via llvm-dev

2017-Dec-08 03:30 UTC

head link

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

On Thu, Dec 7, 2017 at 12:20 PM David Blaikie <dblaikie at gmail.com>
wrote:
> On Thu, Dec 7, 2017 at 4:47 AM George Rimar <grimar at
accesssoftek.com>
> wrote:
>
>> >*nod* That's been the historic ELF+DWARF approach, but both
MacOS (with
>> dsyms+DWARF) and Windows
>> >(COFF+CodeView+PDB) don't do it that way, and instead involve
the linker
>> to a degree.
>> >Mostly I'm wondering if it'd be reasonable to (and if
anyone would be
>> interested in doing it) do
>> >something more like the PDB support - fully debug-aware linking.
>>
>> Honestly saying I only know how ELF linker works and may be my thoughts
>> below are silly for some reason or duplicating
>> some already existent approach. Looking at what .dwp do, looks there
are
>> two main things reducing size debug data:
>> 1) "It must allow for the removal of duplicate type units".
>> 2) "It must allow for the removal of duplicate strings".
>>
>
> Yeah, DWPs are mostly the same as a linker linking debug info without
> knowing anything aobut it. Except instead of relocations, it uses the
> cu/tu_index section (& str_index section). Otherwise the DWP packaging
tool
> doesn't know anything about the debug info (it doesn't need to
parse many
> DIEs, etc).
>
> This is still simple/coarse grained compared to Windows PDBs or MacOS
> dsyms.
>
>
>
>> Linker already deduplicates strings by itself, though it can delegate
it
>> to some API for debug sections.
>> And what it could probably do is call some library API. Linker could
give
>> it a some set (or all of)
>> .debug_* sections so this library would rebuild and optimize the dwarf
>> data, eliminate duplicates, and
>> return optimized debug sections back to linker. Then linker would
perform
>> relocations and emit the result to output.
>>
>> That way library can be used for stand alone post proccessing tool
>> probably
>> and linker should be able to work with data on a sections level only
and
>> be not DWARF aware.
>>
>
> Postprocessing (ie: running a tool on the fully linked binary with the
> debug info we have today, and having the tool reprocess the debug info to
> make it more compact) is an option, but wouldn't help address the
problem
> you started with - that the output can't fit the large offsets, so the
> output is invalid/broken. So that output would be broken before the
> postprocessing step could run to compact things.
>
>
>>
>> >Sure - but it works/is supported/is implemented. If someone wants
to
>> implement the newer thing, that's cool, but I don't have any
>> >personal motivation to do so for example. (& honestly we've
been
>> throwing around some ideas about how to further generalize the
>> >debug_info contributions to reduce some of the overhead of
isolating
>> types - so maybe if we're lazy enough, we might leapfrog
>> >this particular state and just implement that future better thing)
>>
>> I see. Basing on all comments in this thread I am inclined to agree
that
>> implementing newer thing does not make much sence atm.
>> For now I prepared patch to error out when LLD faces objects with
>> multiple .debug_* sections for cases when we do not support it.
>> (D40950). (In LLD we are supporting deduplicating COMDATs, so
generally
>> such object is not a problem as already supported,
>> but for error reporting purposes and for --gdb-index we assume debug
>> sections are unique in object,
>> so in that case we looks want to error out).
>>
>> Have last thoughts/question about this though :)
>>
>> Currently clang -gdwarf-5 -fdebug-types-section works. And so linker
can
>> deduplicate types. Though that probably violates
>> specification saying there is no more .debug_type sections. But
behavior
>> is convinent for users of -fdebug-types-section.
>> I do not know how transition from v4 to v5 will happen/happens (or how
>> transition between dwarf standarts usually happens).
>> I suppose one day clang just will start to produce v5 debug data by
>> default.
>> And at the same time multiple .debug_info sections mentioned in DWARF5
>> spec as an optimization, so it should not be a mandatory
>> thing to implement. If so it just seems that either we will need to
>> implement this optimization before switching to v5 by default or allow
>> -gdwarf-5 -fdebug-types-section to support existent use case. And
since
>> it is already works and already allowed in releases it probably means
it is
>> acceptable to keep (and use) this behavior ? (If so, attempt to
leapfrog
>> can be nice strategy IMO).
>>
>> >>>>I think Paul covered some of the reasons type units
might not be a
>> reasonable default.
>> >>>
>> >>>One additional reason is that if you use Split DWARF
(another great
>> way to massively reduce the amount of debug info going to the linker)
>> >>>type units are mostly /just/ overhead in the .dwo files:
since the
>> debug info is not linked, there's no opportunity to remove the
>> >>>duplication anyway (unless you're making a DWP - like a
>dsym file)
>> >>
>> >>Yeah. Looks -gsplit-dwarf and -fdebug-types-section are
harmfull
>> together. Probably it worth to restrict using of them together or
>> >>emit a warning (both clang and gcc silently allows the
combination and
>> output has size penalty you describing).
>> >
>> >Nah, only if you're not producing a DWP at the end (
>> https://gcc.gnu.org/wiki/DebugFissionDWP ).
>>
>> Sure DWP do great job here it seems, but even for DWP use case flow it
>> does not look make sence to force compiler to do excessive job
>> to produce types sections, because DWP producing tools probably should
>> have no benefit from larger .dwo files with .debug_types at all I
think.
>>
>
> The current DWP tools (one in binutils, one in LLVM) don't do
DWARF-aware
> debug info compaction. They just concatenate the sections together,
> deduplicate strings, deduplicate type units.
>
> So, yes, to have a smaller DWP file in the end it's beneficial to use
type
> units (be they in debug_types or debug_info).
>
> But a fancier DWP tool that would process all the DWARF and compact the
> result wouldn't need explicit type units & could avoid that
overhead.
>
>
Prior art is "dwz" written by Jakub Jelinek :)

-eric


> I can only imagine now that somebody could use -gsplit-dwarf and
>> -fdebug-types-section together so that can parse .debug_types.dwo
>> instead of parsing .debug_info.dwo to look for types in a bit more
>> convinent way, but that looks too synthetic case.
>>
>> >In short, I probably wouldn't change any of LLVM's
defaults. But there
>> are certainly flags people can use to reduce their debug info size.
>> >
>> >You mentioned starting with this because LLVM's defaults mean
the DWARF
>> is too large to link with DWARF 32 bit? How does gold cope with this?
>> >I haven't seen failures/error messages/etc from either gold or
lld
>> related to this? (though I mostly use Split DWARF myself)
>>
>> I posted some results earlier here:
>> https://bugs.llvm.org//show_bug.cgi?id=31109#c3,
>> in short: gold 2.26.1 silently ignored this (probably produced broken
>> output), and
>> newer versions of gold are able to report and catch the same error.
>>
>> I think it is simply still not common to have such a large debug
>> sections, we had only single bug about this so far. And hopefully
>> DWARF64 can be a solution, though it can just hide the issue, looks
would
>> be nice to reduce amount of debug data we produce still.
>>
>> Best regards,
>> George | Developer | Access Softek, Inc
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171208/8e5a4b2f/attachment.html>

George Rimar via llvm-dev

2017-Dec-08 09:09 UTC

head link

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

>Postprocessing (ie: running a tool on the fully linked binary with the debug
info we have today, and having the tool reprocess the debug info to make it more
>compact) is an option, but wouldn't help address the problem you started
with - that the output can't fit the large offsets, so the output is
invalid/broken. So that >output would be broken before the postprocessing
step could run to compact things.

Right. So then it could be some API that takes .debug_* sections from linker,
takes relocations, additional info,

like info about GCed/ICFed sections. It could rebuild debug data, rebuild
relocations and return it back to linker,

so it could take deduplicated debug info, perform updated relocations and
produce output.


Does not feel nice honestly. It is definetely seems easier to do all of that on
linker side instead.


George.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171208/3488aca7/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Dec 2017 - [RFC] - Deduplication of debug information in linkers (LLD).

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

[llvm-dev] [RFC] - Deduplication of debug information in linkers (LLD).

Reasonably Related Threads