thr3ads.net - llvm dev - [llvm-dev] [DWARF] using simplified template names [Jun 2021]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2021-Jun-05 01:33 UTC

[llvm-dev] [DWARF] using simplified template names

tl;dr: What if we used only the base name of templates in the DW_AT_name
field for function and class templates (eg: "vector" instead of
"vector<int, std::allocator<int>>")?

Context:
We (at Google) have been seeing some significant DWARF growth in binaries
lately due to increased use of libraries like Eigen and TensorFlow that use
expression templates.

This includes some cases where the debug_str.dwo section has exceeded the
DWARF32 limit (& the binutils dwp tool silently wrote overflowed indexes
into the debug_str_offsets.dwo section, unfortunately - leading to
corrupted/garbled names in backtraces) & most of the growth is from the
demangled names of complicated/large expression templates.

Options:
One solution would be to move to DWARF64 - though that does make DWARF
overall larger, which is an unfortunate cost that would be nice to avoid.

Another might be to rely solely on linkage names (add linkage names to
types), since mangled names generally reduce a lot of the duplication -
though in some cases it's not a matter of duplication within a single name,
but possibly many distinct types used as template parameters - though those
types may also be used in other names (& mangled names have no sharing
across names).

Compression doesn't help, since the offsets are into the uncompressed data.

Main idea:
What if templates instead only encoded the base name, such as "vector"
(rather than "vector<int, std::allocator<int>>")? The full
name could still
be reconstructed from the DW_TAG_template_type_parameters (non-type
template parameters would be more difficult, and we'd need to add template
parameters to template declarations - functionality we have, but is only
enabled for SCE today)).

This could significantly reduce debug info size (in some worst-cases I've
seen this lead to a 50% reduction in the uncompressed size of
.debug_str.dwo in a dwp file, for instance - probably less exciting if the
data was compressed - but gives a sense of the headroom available before
this limit will be reached again).

Also has the nice property that it's not a new format or encoding that
might break existing consumers immediately (DWARF64, for instance isn't
widely implemented to my knowledge, so many consumers would need to be
fixed before they could parse any of it) - if a consumer doesn't know,
it'll still see a name, just not the most fully descriptive/specific name
it could be. For a symbolizer this is probably fairly low cost - users
would find it more difficult, but not totally useless to get a simple
template function name.

As it happens, it seems GDB is already built to cope with this situation -
it can print the real name of the type and can even correctly match up two
distinct type declarations between translation units by correctly matching
their template parameters.

GDB Example:
a.h:

template<typename T>

struct t1 { T t = sizeof(T); };

void f(t1<int> &p1, t1<short> *&p2);

a.cpp:

#include "a.h"

int main() {

  t1<int> v1;

  t1<short> *v2 = nullptr;

  t1<bool> *v3 = nullptr;

  f(v1, v2);

}

b.cpp:

#include "a.h"

void f(t1<int> &p1, t1<short> *&p2) {

  static t1<short> v2;

  p2 = &v2;

}


// using a clang modified to produce simple template names, and
// to include template parameters on declarations
// (-Xclang -debug-forward-template-params)
$ clang++ a.cpp b.cpp -g
$ llvm-dwarfdump a.out (glossing over some details)

DW_TAG_compile_unit

  DW_AT_name    ("a.cpp")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_TAG_template_type_parameter

      DW_AT_type        (0x00000098 "int")

      DW_AT_name        ("T")

    DW_TAG_member

      DW_AT_name        ("t")

      DW_AT_type        (0x00000098 "int")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000000e2 "short")

      DW_AT_name        ("T")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000000fd "bool")

      DW_AT_name        ("T")

DW_TAG_compile_unit

  DW_AT_name    ("b.cpp")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_TAG_template_type_parameter

      DW_AT_type        (0x0000019e "short")

      DW_AT_name        ("T")

    DW_TAG_member

      DW_AT_name        ("t")

      DW_AT_type        (0x0000019e "short")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000001b9 "int")

      DW_AT_name        ("T")
$ gdb ./a.out

(gdb) start

(gdb) ptype v1

type = struct t1<int> [with T = int] {

    T t;

}

(gdb) ptype v2

type = struct t1<short> [with T = short] {

    T t;

} *

(gdb) ptype v3

type = struct t1<bool> {

    <incomplete type>

} *

(gdb) ptype v1.t

type = int

(gdb) ptype v2->t

type = short

(gdb) ptype v3->t

There is no member named t.



So in this example we have one instantiation (t1<int>) declared in the
first CU and defined in the second, one instantation (t1<short>) declared
in the first and defined in the second, and a third instantiation
(t1<bool>) declared in the first and not defined anywhere.

GDB has correctly rendered the type names, despite lacking the template
parameter lists being in the DW_AT_name - and has correctly associated the
definitions with the declarations despite the DW_AT_name being ambiguous,
by using the DW_TAG_template_type_parameters.

lldb doesn't cope with this sort of DWARF currently - it has a bunch of
assumptions about the names of template instantiations that'll need to be
fixed before it can consume this sort of thing.

I haven't tested a wide number of symbolizers, but I assume they'll
generally need some work too.

So... how's this sound to everyone? An idea worth pursuing?
Concerns/questions/etc.

I don't expect this to become the default for LLVM in the short term at
least - but under a flag for those whose consumers can handle it (/maybe/
we do it under debugger tuning for gdb, since it seems OK with it - but
that might be a bit stronger than we want to do under the default tuning,
since it's really broken for lldb, not just a little bit broken).

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/c0f2f5ba/attachment.html>

via llvm-dev

2021-Jun-07 22:00 UTC

head link

[llvm-dev] [DWARF] using simplified template names

Fully expanded names of template instantiations can become impressively large,
yeah.

The DWARF Wiki’s Best Practices page
http://wiki.dwarfstd.org/index.php?title=Best_Practices recommends including a
canonical form of the template parameters in the DW_AT_name attribute.  I don’t
know that I agree; it talks about omitting qualifiers (namespaces, containing
classes) because those can be reconstructed from the DIE hierarchy, but the same
argument can be made for template parameters (the difference being that
qualifiers come from higher up the tree, while template parameters come from
farther down).  The DRY principle would seem to apply here.

I’ll verify with our debugger team, but I’m confident that dropping the
<params> from the type name will not affect Sony, as our debugger looks at
the template-parameter children already (that’s why we have that turned on by
default for sce tuning).  LLDB seems to be the odd debugger out, here, and we
have some control over that. 😊

Oh, is there any consequence for deduplication in LTO?  Isn’t that name-based?
--paulr

From: David Blaikie <dblaikie at gmail.com>
Sent: Friday, June 4, 2021 9:33 PM
To: llvm-dev <llvm-dev at lists.llvm.org>; Robinson, Paul
<paul.robinson at sony.com>; Adrian Prantl <aprantl at apple.com>;
Jonas Devlieghere <jdevlieghere at apple.com>; Henderson, James
<James.Henderson at sony.com>; Caroline Tice <cmtice at google.com>;
Eric Christopher <echristo at gmail.com>
Subject: [DWARF] using simplified template names

tl;dr: What if we used only the base name of templates in the DW_AT_name field
for function and class templates (eg: "vector" instead of
"vector<int, std::allocator<int>>")?

Context:
We (at Google) have been seeing some significant DWARF growth in binaries lately
due to increased use of libraries like Eigen and TensorFlow that use expression
templates.

This includes some cases where the debug_str.dwo section has exceeded the
DWARF32 limit (& the binutils dwp tool silently wrote overflowed indexes
into the debug_str_offsets.dwo section, unfortunately - leading to
corrupted/garbled names in backtraces) & most of the growth is from the
demangled names of complicated/large expression templates.

Options:
One solution would be to move to DWARF64 - though that does make DWARF overall
larger, which is an unfortunate cost that would be nice to avoid.

Another might be to rely solely on linkage names (add linkage names to types),
since mangled names generally reduce a lot of the duplication - though in some
cases it's not a matter of duplication within a single name, but possibly
many distinct types used as template parameters - though those types may also be
used in other names (& mangled names have no sharing across names).

Compression doesn't help, since the offsets are into the uncompressed data.

Main idea:
What if templates instead only encoded the base name, such as "vector"
(rather than "vector<int, std::allocator<int>>")? The full
name could still be reconstructed from the DW_TAG_template_type_parameters
(non-type template parameters would be more difficult, and we'd need to add
template parameters to template declarations - functionality we have, but is
only enabled for SCE today)).

This could significantly reduce debug info size (in some worst-cases I've
seen this lead to a 50% reduction in the uncompressed size of .debug_str.dwo in
a dwp file, for instance - probably less exciting if the data was compressed -
but gives a sense of the headroom available before this limit will be reached
again).

Also has the nice property that it's not a new format or encoding that might
break existing consumers immediately (DWARF64, for instance isn't widely
implemented to my knowledge, so many consumers would need to be fixed before
they could parse any of it) - if a consumer doesn't know, it'll still
see a name, just not the most fully descriptive/specific name it could be. For a
symbolizer this is probably fairly low cost - users would find it more
difficult, but not totally useless to get a simple template function name.

As it happens, it seems GDB is already built to cope with this situation - it
can print the real name of the type and can even correctly match up two distinct
type declarations between translation units by correctly matching their template
parameters.

GDB Example:
a.h:

template<typename T>

struct t1 { T t = sizeof(T); };

void f(t1<int> &p1, t1<short> *&p2);
a.cpp:

#include "a.h"

int main() {

  t1<int> v1;

  t1<short> *v2 = nullptr;

  t1<bool> *v3 = nullptr;

  f(v1, v2);

}
b.cpp:

#include "a.h"

void f(t1<int> &p1, t1<short> *&p2) {

  static t1<short> v2;

  p2 = &v2;

}

// using a clang modified to produce simple template names, and
// to include template parameters on declarations
// (-Xclang -debug-forward-template-params)
$ clang++ a.cpp b.cpp -g
$ llvm-dwarfdump a.out (glossing over some details)


DW_TAG_compile_unit

  DW_AT_name    ("a.cpp")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_TAG_template_type_parameter

      DW_AT_type        (0x00000098 "int")

      DW_AT_name        ("T")

    DW_TAG_member

      DW_AT_name        ("t")

      DW_AT_type        (0x00000098 "int")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000000e2 "short")

      DW_AT_name        ("T")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000000fd "bool")

      DW_AT_name        ("T")

DW_TAG_compile_unit

  DW_AT_name    ("b.cpp")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_TAG_template_type_parameter

      DW_AT_type        (0x0000019e "short")

      DW_AT_name        ("T")

    DW_TAG_member

      DW_AT_name        ("t")

      DW_AT_type        (0x0000019e "short")

  DW_TAG_structure_type

    DW_AT_name  ("t1")

    DW_AT_declaration   (true)

    DW_TAG_template_type_parameter

      DW_AT_type        (0x000001b9 "int")

      DW_AT_name        ("T")
$ gdb ./a.out

(gdb) start

(gdb) ptype v1

type = struct t1<int> [with T = int] {

    T t;

}

(gdb) ptype v2

type = struct t1<short> [with T = short] {

    T t;

} *

(gdb) ptype v3

type = struct t1<bool> {

    <incomplete type>

} *

(gdb) ptype v1.t

type = int

(gdb) ptype v2->t

type = short

(gdb) ptype v3->t

There is no member named t.




So in this example we have one instantiation (t1<int>) declared in the
first CU and defined in the second, one instantation (t1<short>) declared
in the first and defined in the second, and a third instantiation
(t1<bool>) declared in the first and not defined anywhere.

GDB has correctly rendered the type names, despite lacking the template
parameter lists being in the DW_AT_name - and has correctly associated the
definitions with the declarations despite the DW_AT_name being ambiguous, by
using the DW_TAG_template_type_parameters.

lldb doesn't cope with this sort of DWARF currently - it has a bunch of
assumptions about the names of template instantiations that'll need to be
fixed before it can consume this sort of thing.

I haven't tested a wide number of symbolizers, but I assume they'll
generally need some work too.

So... how's this sound to everyone? An idea worth pursuing?
Concerns/questions/etc.

I don't expect this to become the default for LLVM in the short term at
least - but under a flag for those whose consumers can handle it (/maybe/ we do
it under debugger tuning for gdb, since it seems OK with it - but that might be
a bit stronger than we want to do under the default tuning, since it's
really broken for lldb, not just a little bit broken).

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210607/c7ea4ce4/attachment.html>

Adrian Prantl via llvm-dev

2021-Jun-07 23:29 UTC

head link

[llvm-dev] [DWARF] using simplified template names

> On Jun 4, 2021, at 6:33 PM, David Blaikie <dblaikie at gmail.com>
wrote:
> 
> tl;dr: What if we used only the base name of templates in the DW_AT_name
field for function and class templates (eg: "vector" instead of
"vector<int, std::allocator<int>>")?
> 
> Context:
> We (at Google) have been seeing some significant DWARF growth in binaries
lately due to increased use of libraries like Eigen and TensorFlow that use
expression templates.
> 
> This includes some cases where the debug_str.dwo section has exceeded the
DWARF32 limit (& the binutils dwp tool silently wrote overflowed indexes
into the debug_str_offsets.dwo section, unfortunately - leading to
corrupted/garbled names in backtraces) & most of the growth is from the
demangled names of complicated/large expression templates.
> 
> Options:
> One solution would be to move to DWARF64 - though that does make DWARF
overall larger, which is an unfortunate cost that would be nice to avoid.
> 
> Another might be to rely solely on linkage names (add linkage names to
types), since mangled names generally reduce a lot of the duplication - though
in some cases it's not a matter of duplication within a single name, but
possibly many distinct types used as template parameters - though those types
may also be used in other names (& mangled names have no sharing across
names).
Without having measured this, I find it plausible to believe that the DWARF DIE
tree together with base names can be more compact than linkage names (=mangled
type names) on every DIE because of the sharing within more complex types.
> 
> Compression doesn't help, since the offsets are into the uncompressed
data.
> 
> Main idea:
> What if templates instead only encoded the base name, such as
"vector" (rather than "vector<int,
std::allocator<int>>")? The full name could still be reconstructed
from the DW_TAG_template_type_parameters (non-type template parameters would be
more difficult, and we'd need to add template parameters to template
declarations - functionality we have, but is only enabled for SCE today)).
> 
> This could significantly reduce debug info size (in some worst-cases
I've seen this lead to a 50% reduction in the uncompressed size of
.debug_str.dwo in a dwp file, for instance - probably less exciting if the data
was compressed - but gives a sense of the headroom available before this limit
will be reached again).
> 
> Also has the nice property that it's not a new format or encoding that
might break existing consumers immediately (DWARF64, for instance isn't
widely implemented to my knowledge, so many consumers would need to be fixed
before they could parse any of it) - if a consumer doesn't know, it'll
still see a name, just not the most fully descriptive/specific name it could be.
For a symbolizer this is probably fairly low cost - users would find it more
difficult, but not totally useless to get a simple template function name.
> 
> As it happens, it seems GDB is already built to cope with this situation -
it can print the real name of the type and can even correctly match up two
distinct type declarations between translation units by correctly matching their
template parameters.
> 
> GDB Example:
> a.h:
> template<typename T>
> struct t1 { T t = sizeof(T); };
> void f(t1<int> &p1, t1<short> *&p2);
> a.cpp:
> #include "a.h"
> int main() {
>   t1<int> v1;
>   t1<short> *v2 = nullptr;
>   t1<bool> *v3 = nullptr;
>   f(v1, v2);
> }
> b.cpp:
> #include "a.h"
> void f(t1<int> &p1, t1<short> *&p2) {
>   static t1<short> v2;
>   p2 = &v2;
> }
> 
> // using a clang modified to produce simple template names, and 
> // to include template parameters on declarations 
> // (-Xclang -debug-forward-template-params)
> $ clang++ a.cpp b.cpp -g
> $ llvm-dwarfdump a.out (glossing over some details)
> DW_TAG_compile_unit
>   DW_AT_name    ("a.cpp")
>   DW_TAG_structure_type
>     DW_AT_name  ("t1")
>     DW_TAG_template_type_parameter
>       DW_AT_type        (0x00000098 "int")
>       DW_AT_name        ("T")
>     DW_TAG_member
>       DW_AT_name        ("t")
>       DW_AT_type        (0x00000098 "int")
>   DW_TAG_structure_type
>     DW_AT_name  ("t1")
>     DW_AT_declaration   (true)
>     DW_TAG_template_type_parameter
>       DW_AT_type        (0x000000e2 "short")
>       DW_AT_name        ("T")
>   DW_TAG_structure_type
>     DW_AT_name  ("t1")
>     DW_AT_declaration   (true)
>     DW_TAG_template_type_parameter
>       DW_AT_type        (0x000000fd "bool")
>       DW_AT_name        ("T")
> 
> DW_TAG_compile_unit
>   DW_AT_name    ("b.cpp")
>   DW_TAG_structure_type
>     DW_AT_name  ("t1")
>     DW_TAG_template_type_parameter
>       DW_AT_type        (0x0000019e "short")
>       DW_AT_name        ("T")
>     DW_TAG_member
>       DW_AT_name        ("t")
>       DW_AT_type        (0x0000019e "short")
>   DW_TAG_structure_type
>     DW_AT_name  ("t1")
>     DW_AT_declaration   (true)
>     DW_TAG_template_type_parameter
>       DW_AT_type        (0x000001b9 "int")
>       DW_AT_name        ("T")
> $ gdb ./a.out
> (gdb) start
> (gdb) ptype v1
> type = struct t1<int> [with T = int] {
>     T t;
> }
> (gdb) ptype v2
> type = struct t1<short> [with T = short] {
>     T t;
> } *
> (gdb) ptype v3
> type = struct t1<bool> {
>     <incomplete type>
> } *
> (gdb) ptype v1.t
> type = int
> (gdb) ptype v2->t
> type = short
> (gdb) ptype v3->t
> There is no member named t.
> 
> 
> So in this example we have one instantiation (t1<int>) declared in
the first CU and defined in the second, one instantation (t1<short>)
declared in the first and defined in the second, and a third instantiation
(t1<bool>) declared in the first and not defined anywhere.
> 
> GDB has correctly rendered the type names, despite lacking the template
parameter lists being in the DW_AT_name - and has correctly associated the
definitions with the declarations despite the DW_AT_name being ambiguous, by
using the DW_TAG_template_type_parameters.
> 
> lldb doesn't cope with this sort of DWARF currently - it has a bunch of
assumptions about the names of template instantiations that'll need to be
fixed before it can consume this sort of thing.
I'm pretty this will break at least some workflows in LLDB, but perhaps not
necessarily the most useful ones. LLDB will search types by name in many
situations, but the fact that template types can be formatted in many different
ways and may contain whitespace makes this process brittle already. In order to
support currently supported workflows we may need to implement a type lookup
where we stri out everything but the basename in the searched type, then do a
by-(base)name lookup, and then filter for template arguments. From afar this
sounds doable, but we should make sure not to enable this debug info
optimization without qualifying it in LLDB first.

-- adrian> 
> I haven't tested a wide number of symbolizers, but I assume they'll
generally need some work too.
> 
> So... how's this sound to everyone? An idea worth pursuing?
Concerns/questions/etc.
> 
> I don't expect this to become the default for LLVM in the short term at
least - but under a flag for those whose consumers can handle it (/maybe/ we do
it under debugger tuning for gdb, since it seems OK with it - but that might be
a bit stronger than we want to do under the default tuning, since it's
really broken for lldb, not just a little bit broken).
> 
> - Dave
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210607/e2bcd932/attachment.html>

llvm dev - Jun 2021 - [DWARF] using simplified template names

[llvm-dev] [DWARF] using simplified template names

[llvm-dev] [DWARF] using simplified template names

[llvm-dev] [DWARF] using simplified template names