thr3ads.net - llvm dev - [llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2021-Jul-02 20:59 UTC

[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)

On Thu, Jul 1, 2021 at 8:22 PM Reid Kleckner <rnk at google.com> wrote:
> It could work, but the long linkage names will still be present in
> .strtab, so I wonder if it would make more sense to pursue a solution that
> addresses both issues. I happen to know you were considering a separate
> proposal for that, and I wonder if it could be used to solve this problem
> as well. Either way, the debug info consumer must be taught to look up or
> reconstitute the long mangled name.
>
True.

(for everyone else's context: I've been tossing around the idea for a
while
to have an option to use hashed names instead of mangled names for object
symbols (actually I're starting to consider maybe generalizing this to an
entire floating ABI - if you can guarantee all the C++ is being compiled
with the same clang version - it can arbitrarily pick ABI, symbol names,
etc, that only have to agree with itself - not with some other version used
to compile some precompiled library, etc) - though we'd still want to
preserve the mangled names maybe heaped together in a compressed section,
so that the linker could provide human-actionable diagnostics to the user
in the event of linker errors)

Though I worry that even some way to reference strings in that compressed
blob would take up space we could be saving & the time/space tradeoff might
not be worthwhile. Referencing (rather than reconstituting) would have the
advantage that there would be no risk of incorrect reconstitution, which
would be nice - but could be limiting. (for instance - we might at some
point want to support links with the symbol names omitted in some modes
where linker errors are especially unlikely (continuous integration, etc) -
then repeat the link with the symbol names added to get good diagnostics -
though I suppose in many cases like that we wouldn't want debug info
either... but maybe sometimes, etc)

> I was thinking something like, "if symbol name is longer than X
threshold,
> replace it with _H${contenthash}, place the long name in a side table
> section". Tools that are aware of the new convention can do the lookup
in
> the side table. Tools that are unaware will just produce funny names. The
> DWARF linkage name would use the _H symbol, and consumers that care beyond
> just having a unique linkage identifier can do the lookup.
>
Yeah, with DWARF we'd probably make something a bit more explicit - a new
DW_FORM, or new attribute name - though guess there's some benefit to
producing the unique name that everyone can use even if it's not very
legible.

Yeah, if I reframe this in my head: What if we fixed the ELF symbol name
length problems (by using such a hash scheme) - would the remaining DWARF
size cost be worth the complexity of reconstitution & risk of incorrect
reconstitution? Maybe not.

Though perhaps there's folks who might be interested in the reconstitution
savings when they can't change their ABI? In that case it'd be pretty
misleading to include an incorrect value for the mangled name in the
DW_TAG_linkage_name field. We could introduce a different attribute for it
in that case.

(I guess if we used references to this shared "real linkage name
section" -
there wouldn't be an issue with stripped binaries: If you stripped out the
linkage name section you probably stripped out the debug info sections too
so there wouldn't be anything left to debug/reference the stripped linkage
names)

Alternatively: If we did this reconstituted linkage name thing, the hashed
symbols ELF feature could potentially skip the linkage names when there's
debug info present and rely on reconstituting the names...

In summary: I've mixed thoughts on this.

- Dave

>
> There is prior art for this. MSVC caps linkage names at 4096, I believe,
> and hashes the name down with MD5:
>
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53
>
> On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> In addition to simplifying template names (
>> https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg ) another case
I've
>> found in my use case is a lot of mangled names (in part because we
build
>> with -fdebug-info-for-profiling which turns on function linkage names
even
>> at -g1/-gmlt).
>>
>> So I was wondering if we could recreate linkage names from DWARF,
rather
>> than encoding them directly - and I have a prototype that seems to show
>> this is possible (at least some simple cases - including some template
>> cases).
>>
>> In the pathological case I'm looking at (lots of expression
templates in
>> TensorFlow) skipping linkage names in the cases I think we can
reconstitute
>> (but I haven't implemented the full logic and verified everything
can be
>> reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks
with
>> the 43% reduction from the simplified template names - for a 95%
reduction
>> in total) and in a large but less pathological binary it was 56% (in
>> addition to 25% from the template names, still 80% reduction overall).
>>
>> Wondering if anyone's interested in this? Has
>> thoughts/feelings/concerns/etc?
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210702/842f1f8b/attachment-0001.html>

Greg Clayton via llvm-dev

2021-Aug-23 22:15 UTC

head link

[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)

The idea of encoding names more efficiently is a great idea. I would have no
concerns if the following were true:
- we could 100% always reconstruct linkages names if we need to
- accelerator tables that are trusted by debuggers (.debug_names, or .apple_XXX)
that used to contain linkage names still do after this change

The main reason for this is for the LLDB expression parser. When the expression
parser needs to call a function, the interface we have with the JIT code in LLVM
means we always lookup functions by linkage (mangled) name. So if the
accelerator tables don't have the mangled names inside of them, we will need
to know how/when we would need to ignore the accelerator tables and manually
index the DWARF each time you debug. Right now LLDB and GDB don't trust
.debug_pubnames or .debug_pubtypes because they don't index everything.
.debug_names has more struct rules on what needs to be included, so any solution
should make sure we don't change the contents of this section for a binary
compiled with and without this new feature.

I like the idea of being able to refer to a string from the main string table of
the object file (.strtab for ELF, or LC_SYMTAB in macho) if they already exist
there, it would be interesting to compare the symbols that are in both the
.debug_str and .symtab from one of these large C++ binaries just to see how much
space we could save if we had a new for DW_FORM_symtab_str that could refer to
this section.

Another idea would be to have a new attribute that relies on the parent DIE
chain where each child would encode it's partial mangled named. Something
like DW_AT_linkage_prefix and/or DW_AT_linkage_suffix. Then you could traverse
the parent DIEs to reconstruct the full linkage name.

So if we have 

namepace foo {
  class bar {
    void print(const char *) const;
  }
}

The DWARF could be something like:

DW_TAG_namespace
DW_AT_name("foo")
DW_AT_linkage_prefix("_Z3foo")

  DW_TAG_class_type
  DW_AT_name("bar")
  DW_AT_linkage_prefix("3bar")  

    DW_TAG_subprogram
    DW_AT_name("print")
    DW_AT_linkage_prefix("5print")
    DW_AT_linkage_suffix(" const")
      
      DW_TAG_parameter
      DW_AT_name("format")
      DW_AT_linkage_prefix("int")
    
This might allow a lot more name sharing between templated functions since their
function base names like "erase", "begin", "end"
and many more could be shared in the string tables.


> On Jul 2, 2021, at 1:59 PM, David Blaikie via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On Thu, Jul 1, 2021 at 8:22 PM Reid Kleckner <rnk at google.com
<mailto:rnk at google.com>> wrote:
> It could work, but the long linkage names will still be present in .strtab,
so I wonder if it would make more sense to pursue a solution that addresses both
issues. I happen to know you were considering a separate proposal for that, and
I wonder if it could be used to solve this problem as well. Either way, the
debug info consumer must be taught to look up or reconstitute the long mangled
name.
> 
> True.
> 
> (for everyone else's context: I've been tossing around the idea for
a while to have an option to use hashed names instead of mangled names for
object symbols (actually I're starting to consider maybe generalizing this
to an entire floating ABI - if you can guarantee all the C++ is being compiled
with the same clang version - it can arbitrarily pick ABI, symbol names, etc,
that only have to agree with itself - not with some other version used to
compile some precompiled library, etc) - though we'd still want to preserve
the mangled names maybe heaped together in a compressed section, so that the
linker could provide human-actionable diagnostics to the user in the event of
linker errors)
> 
> Though I worry that even some way to reference strings in that compressed
blob would take up space we could be saving & the time/space tradeoff might
not be worthwhile. Referencing (rather than reconstituting) would have the
advantage that there would be no risk of incorrect reconstitution, which would
be nice - but could be limiting. (for instance - we might at some point want to
support links with the symbol names omitted in some modes where linker errors
are especially unlikely (continuous integration, etc) - then repeat the link
with the symbol names added to get good diagnostics - though I suppose in many
cases like that we wouldn't want debug info either... but maybe sometimes,
etc)
>  
> I was thinking something like, "if symbol name is longer than X
threshold, replace it with _H${contenthash}, place the long name in a side table
section". Tools that are aware of the new convention can do the lookup in
the side table. Tools that are unaware will just produce funny names. The DWARF
linkage name would use the _H symbol, and consumers that care beyond just having
a unique linkage identifier can do the lookup.
> 
> Yeah, with DWARF we'd probably make something a bit more explicit - a
new DW_FORM, or new attribute name - though guess there's some benefit to
producing the unique name that everyone can use even if it's not very
legible.
> 
> Yeah, if I reframe this in my head: What if we fixed the ELF symbol name
length problems (by using such a hash scheme) - would the remaining DWARF size
cost be worth the complexity of reconstitution & risk of incorrect
reconstitution? Maybe not.
> 
> Though perhaps there's folks who might be interested in the
reconstitution savings when they can't change their ABI? In that case
it'd be pretty misleading to include an incorrect value for the mangled name
in the DW_TAG_linkage_name field. We could introduce a different attribute for
it in that case.
> 
> (I guess if we used references to this shared "real linkage name
section" - there wouldn't be an issue with stripped binaries: If you
stripped out the linkage name section you probably stripped out the debug info
sections too so there wouldn't be anything left to debug/reference the
stripped linkage names)
> 
> Alternatively: If we did this reconstituted linkage name thing, the hashed
symbols ELF feature could potentially skip the linkage names when there's
debug info present and rely on reconstituting the names...
> 
> In summary: I've mixed thoughts on this.
> 
> - Dave
>  
> 
> There is prior art for this. MSVC caps linkage names at 4096, I believe,
and hashes the name down with MD5:
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53
<https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53>
> On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> In addition to simplifying template names (
https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg
<https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg> ) another case
I've found in my use case is a lot of mangled names (in part because we
build with -fdebug-info-for-profiling which turns on function linkage names even
at -g1/-gmlt).
> 
> So I was wondering if we could recreate linkage names from DWARF, rather
than encoding them directly - and I have a prototype that seems to show this is
possible (at least some simple cases - including some template cases).
> 
> In the pathological case I'm looking at (lots of expression templates
in TensorFlow) skipping linkage names in the cases I think we can reconstitute
(but I haven't implemented the full logic and verified everything can be
reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks with the
43% reduction from the simplified template names - for a 95% reduction in total)
and in a large but less pathological binary it was 56% (in addition to 25% from
the template names, still 80% reduction overall).
> 
> Wondering if anyone's interested in this? Has
thoughts/feelings/concerns/etc?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210823/57d8359a/attachment-0001.html>

llvm dev - Aug 2021 - DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)

[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)

[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)