Reid Kleckner via llvm-dev
2021-Jul-02 03:21 UTC
[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)
It could work, but the long linkage names will still be present in .strtab, so I wonder if it would make more sense to pursue a solution that addresses both issues. I happen to know you were considering a separate proposal for that, and I wonder if it could be used to solve this problem as well. Either way, the debug info consumer must be taught to look up or reconstitute the long mangled name. I was thinking something like, "if symbol name is longer than X threshold, replace it with _H${contenthash}, place the long name in a side table section". Tools that are aware of the new convention can do the lookup in the side table. Tools that are unaware will just produce funny names. The DWARF linkage name would use the _H symbol, and consumers that care beyond just having a unique linkage identifier can do the lookup. There is prior art for this. MSVC caps linkage names at 4096, I believe, and hashes the name down with MD5: https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53 On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev < llvm-dev at lists.llvm.org> wrote:> In addition to simplifying template names ( > https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg ) another case I've > found in my use case is a lot of mangled names (in part because we build > with -fdebug-info-for-profiling which turns on function linkage names even > at -g1/-gmlt). > > So I was wondering if we could recreate linkage names from DWARF, rather > than encoding them directly - and I have a prototype that seems to show > this is possible (at least some simple cases - including some template > cases). > > In the pathological case I'm looking at (lots of expression templates in > TensorFlow) skipping linkage names in the cases I think we can reconstitute > (but I haven't implemented the full logic and verified everything can be > reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks with > the 43% reduction from the simplified template names - for a 95% reduction > in total) and in a large but less pathological binary it was 56% (in > addition to 25% from the template names, still 80% reduction overall). > > Wondering if anyone's interested in this? Has > thoughts/feelings/concerns/etc? > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210701/899d704b/attachment.html>
Sterling Augustine via llvm-dev
2021-Jul-02 04:23 UTC
[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)
One possibility is to make the reference to the linkage name an indirection into strtab proper rather than .debug_strtab. There are issues with stripping and such when that is done, but then you only have one copy between the two uses. On Thu, Jul 1, 2021 at 8:22 PM Reid Kleckner via llvm-dev < llvm-dev at lists.llvm.org> wrote:> It could work, but the long linkage names will still be present in > .strtab, so I wonder if it would make more sense to pursue a solution that > addresses both issues. I happen to know you were considering a separate > proposal for that, and I wonder if it could be used to solve this problem > as well. Either way, the debug info consumer must be taught to look up or > reconstitute the long mangled name. > > I was thinking something like, "if symbol name is longer than X threshold, > replace it with _H${contenthash}, place the long name in a side table > section". Tools that are aware of the new convention can do the lookup in > the side table. Tools that are unaware will just produce funny names. The > DWARF linkage name would use the _H symbol, and consumers that care beyond > just having a unique linkage identifier can do the lookup. > > There is prior art for this. MSVC caps linkage names at 4096, I believe, > and hashes the name down with MD5: > > https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53 > > On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> In addition to simplifying template names ( >> https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg ) another case I've >> found in my use case is a lot of mangled names (in part because we build >> with -fdebug-info-for-profiling which turns on function linkage names even >> at -g1/-gmlt). >> >> So I was wondering if we could recreate linkage names from DWARF, rather >> than encoding them directly - and I have a prototype that seems to show >> this is possible (at least some simple cases - including some template >> cases). >> >> In the pathological case I'm looking at (lots of expression templates in >> TensorFlow) skipping linkage names in the cases I think we can reconstitute >> (but I haven't implemented the full logic and verified everything can be >> reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks with >> the 43% reduction from the simplified template names - for a 95% reduction >> in total) and in a large but less pathological binary it was 56% (in >> addition to 25% from the template names, still 80% reduction overall). >> >> Wondering if anyone's interested in this? Has >> thoughts/feelings/concerns/etc? >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210701/4b772536/attachment.html>
David Blaikie via llvm-dev
2021-Jul-02 20:59 UTC
[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)
On Thu, Jul 1, 2021 at 8:22 PM Reid Kleckner <rnk at google.com> wrote:> It could work, but the long linkage names will still be present in > .strtab, so I wonder if it would make more sense to pursue a solution that > addresses both issues. I happen to know you were considering a separate > proposal for that, and I wonder if it could be used to solve this problem > as well. Either way, the debug info consumer must be taught to look up or > reconstitute the long mangled name. >True. (for everyone else's context: I've been tossing around the idea for a while to have an option to use hashed names instead of mangled names for object symbols (actually I're starting to consider maybe generalizing this to an entire floating ABI - if you can guarantee all the C++ is being compiled with the same clang version - it can arbitrarily pick ABI, symbol names, etc, that only have to agree with itself - not with some other version used to compile some precompiled library, etc) - though we'd still want to preserve the mangled names maybe heaped together in a compressed section, so that the linker could provide human-actionable diagnostics to the user in the event of linker errors) Though I worry that even some way to reference strings in that compressed blob would take up space we could be saving & the time/space tradeoff might not be worthwhile. Referencing (rather than reconstituting) would have the advantage that there would be no risk of incorrect reconstitution, which would be nice - but could be limiting. (for instance - we might at some point want to support links with the symbol names omitted in some modes where linker errors are especially unlikely (continuous integration, etc) - then repeat the link with the symbol names added to get good diagnostics - though I suppose in many cases like that we wouldn't want debug info either... but maybe sometimes, etc)> I was thinking something like, "if symbol name is longer than X threshold, > replace it with _H${contenthash}, place the long name in a side table > section". Tools that are aware of the new convention can do the lookup in > the side table. Tools that are unaware will just produce funny names. The > DWARF linkage name would use the _H symbol, and consumers that care beyond > just having a unique linkage identifier can do the lookup. >Yeah, with DWARF we'd probably make something a bit more explicit - a new DW_FORM, or new attribute name - though guess there's some benefit to producing the unique name that everyone can use even if it's not very legible. Yeah, if I reframe this in my head: What if we fixed the ELF symbol name length problems (by using such a hash scheme) - would the remaining DWARF size cost be worth the complexity of reconstitution & risk of incorrect reconstitution? Maybe not. Though perhaps there's folks who might be interested in the reconstitution savings when they can't change their ABI? In that case it'd be pretty misleading to include an incorrect value for the mangled name in the DW_TAG_linkage_name field. We could introduce a different attribute for it in that case. (I guess if we used references to this shared "real linkage name section" - there wouldn't be an issue with stripped binaries: If you stripped out the linkage name section you probably stripped out the debug info sections too so there wouldn't be anything left to debug/reference the stripped linkage names) Alternatively: If we did this reconstituted linkage name thing, the hashed symbols ELF feature could potentially skip the linkage names when there's debug info present and rely on reconstituting the names... In summary: I've mixed thoughts on this. - Dave> > There is prior art for this. MSVC caps linkage names at 4096, I believe, > and hashes the name down with MD5: > > https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53 > > On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> In addition to simplifying template names ( >> https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg ) another case I've >> found in my use case is a lot of mangled names (in part because we build >> with -fdebug-info-for-profiling which turns on function linkage names even >> at -g1/-gmlt). >> >> So I was wondering if we could recreate linkage names from DWARF, rather >> than encoding them directly - and I have a prototype that seems to show >> this is possible (at least some simple cases - including some template >> cases). >> >> In the pathological case I'm looking at (lots of expression templates in >> TensorFlow) skipping linkage names in the cases I think we can reconstitute >> (but I haven't implemented the full logic and verified everything can be >> reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks with >> the 43% reduction from the simplified template names - for a 95% reduction >> in total) and in a large but less pathological binary it was 56% (in >> addition to 25% from the template names, still 80% reduction overall). >> >> Wondering if anyone's interested in this? Has >> thoughts/feelings/concerns/etc? >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210702/842f1f8b/attachment-0001.html>