thr3ads.net - llvm dev - [llvm-dev] Fragmented DWARF [Oct 2020]

If this information is useful, please help other people find it:
Share via:

James Henderson via llvm-dev

2020-Oct-12 13:41 UTC

[llvm-dev] Fragmented DWARF

Hi all,

At the recent LLVM developers' meeting, I presented a lightning talk on an
approach to reduce the amount of dead debug data left in an executable
following operations such as --gc-sections and duplicate COMDAT removal. In
that presentation, I presented some figures based on linking a game that
had been built by our downstream clang port and fragmented using the
described approach. Since recording the presentation, I ran the same
experiment on a clang package (this time built with a GCC version). The
comparable figures are below:

Link-time speed (s):
+--------------------+-------+---------------+------+------+------+------+------+
| Package variant    | No GC | GC 1 (normal) | GC 2 | GC 3 | GC 4 | GC 5 |
GC 6 |
+--------------------+-------+---------------+------+------+------+------+------+
| Game (plain)       |  4.5  |  4.9          |  4.2 |  3.6 |  3.4 |  3.3 |
3.2 |
| Game (fragmented)  | 11.1  | 11.8          |  9.7 |  8.6 |  7.9 |  7.7 |
7.5 |
| Clang (plain)      | 13.9  | 17.9          | 17.0 | 16.7 | 16.3 | 16.2 |
16.1 |
| Clang (fragmented) | 18.6  | 22.8          | 21.6 | 21.1 | 20.8 | 20.5 |
20.2 |
+--------------------+-------+---------------+------+------+------+------+------+

Output size - Game package (MB):
+---------------------+-------+------+------+------+------+------+------+
| Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
+---------------------+-------+------+------+------+------+------+------+
| Plain (total)       | 1149  | 1121 | 1017 |  965 |  938 |  930 |  928 |
| Plain (DWARF*)      |  845  |  845 |  845 |  845 |  845 |  845 |  845 |
| Plain (other)       |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
| Fragmented (total)  | 1044  |  940 |  556 |  373 |  287 |  263 |  255 |
| Fragmented (DWARF*) |  740  |  664 |  384 |  253 |  194 |  178 |  173 |
| Fragmented (other)  |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
+---------------------+-------+------+------+------+------+------+------+

Output size - Clang (MB):
+---------------------+-------+------+------+------+------+------+------+
| Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
+---------------------+-------+------+------+------+------+------+------+
| Plain (total)       | 2596  | 2546 | 2406 | 2332 | 2293 | 2273 | 2251 |
| Plain (DWARF*)      | 1979  | 1979 | 1979 | 1979 | 1979 | 1979 | 1979 |
| Plain (other)       |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
| Fragmented (total)  | 2397  | 2346 | 2164 | 2069 | 2017 | 1990 | 1963 |
| Fragmented (DWARF*) | 1780  | 1780 | 1738 | 1716 | 1703 | 1696 | 1691 |
| Fragmented (other)  |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
+---------------------+-------+------+------+------+------+------+------+

*DWARF size == total size of .debug_info + .debug_line + .debug_ranges +
.debug_aranges + .debug_loc

Additionally, I have posted https://reviews.llvm.org/D89229 which provides
the python script and linker patches used to reproduce the above results on
my machine. The GC 1/2/3/4/5/6 correspond to the linker option added in
that patch --mark-live-pc with values 1/0.8/0.6/0.4/0.2/0 respectively.

During the conference, the question was asked what the memory usage and
input size impact was. I've summarised these below:

Input file size total (GB):
+--------------------+------------+
| Package variant    | Total Size |
+--------------------+------------+
| Game (plain)       |     2.9    |
| Game (fragmented)  |     4.2    |
| Clang (plain)      |    10.9    |
| Clang (fragmented) |    12.3    |
+--------------------+------------+

Peak Working Set Memory usage (GB):
+--------------------+-------+------+
| Package variant    | No GC | GC 1 |
+--------------------+-------+------+
| Game (plain)       |  4.3  |  4.7 |
| Game (fragmented)  |  8.9  |  8.6 |
| Clang (plain)      | 15.7  | 15.6 |
| Clang (fragmented) | 19.4  | 19.2 |
+--------------------+-------+------+

I'm keen to hear what people's feedback is, and also interested to see
what
results others might see by running this experiment on other input
packages. Also, if anybody has any alternative ideas that meet the goals
listed below, I'd love to hear them!

To reiterate some key goals of fragmented DWARF, similar to what I said in
the presentation:
1) Devise a scheme that gives significant size savings without being too
costly. It's clear from just the two packages I've tried this on that
there
is a fairly hefty link time performance cost, although the exact cost
depends on the nature of the input package. On the other hand, depending on
the nature of the input package, there can also be some big gains.
2) Devise a scheme that doesn't require any linker knowledge of DWARF. The
current approach doesn't quite achieve this properly due to the slight
misuse of SHF_LINK_ORDER, but I expect that a pivot to using non-COMDAT
group sections should solve this problem.
3) Provide some kind of halfway house between simply writing tombstone
values into dead DWARF and fully parsing the DWARF to reoptimise
its/discard the dead bits.

I'm hopeful that changes could be made to the linker to improve the
link-time cost. There seems to be a significant amount of the link time
spent creating the input sections. An alternative would be to devise a
scheme that would avoid the literal splitting into section headers, in
favour of some sort of list of split-points that the linker uses to split
things up (a bit like it already does for .eh_frame or mergeable sections).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201012/99adc6be/attachment.html>

David Blaikie via llvm-dev

2020-Oct-12 19:48 UTC

head link

[llvm-dev] Fragmented DWARF

Awesome! Sorry I missed the lightning talk, but really interested to see
this sort of thing (though it's not directly/immediately applicable to the
use case I work with - Split DWARF, something similar could be used there
with further work)

Though it looks like the patch has mostly linker changes - where/how do you
generate the fragmented DWARF to begin with? Via the Python script? Run
over assembly? I'd be surprised if it was achievable that way - curious to
know more.

Got a rough sense/are you able to run apples-to-apples comparisons with
Alexey's linker-based patches to compare linker time/memory overhead versus
resulting output size gains?

(& yeah, I'm a bit curious about how the linkers do eh_frame rewriting,
if
the format is especially amenable to a lightweight parsing/rewriting and
how we could make the DWARF more amenable to that too)

On Mon, Oct 12, 2020 at 6:41 AM James Henderson <
jh7370.2008 at my.bristol.ac.uk> wrote:
> Hi all,
>
> At the recent LLVM developers' meeting, I presented a lightning talk on
an
> approach to reduce the amount of dead debug data left in an executable
> following operations such as --gc-sections and duplicate COMDAT removal. In
> that presentation, I presented some figures based on linking a game that
> had been built by our downstream clang port and fragmented using the
> described approach. Since recording the presentation, I ran the same
> experiment on a clang package (this time built with a GCC version). The
> comparable figures are below:
>
> Link-time speed (s):
>
>
+--------------------+-------+---------------+------+------+------+------+------+
> | Package variant    | No GC | GC 1 (normal) | GC 2 | GC 3 | GC 4 | GC 5 |
> GC 6 |
>
>
+--------------------+-------+---------------+------+------+------+------+------+
> | Game (plain)       |  4.5  |  4.9          |  4.2 |  3.6 |  3.4 |  3.3
> |  3.2 |
> | Game (fragmented)  | 11.1  | 11.8          |  9.7 |  8.6 |  7.9 |  7.7
> |  7.5 |
> | Clang (plain)      | 13.9  | 17.9          | 17.0 | 16.7 | 16.3 | 16.2 |
> 16.1 |
> | Clang (fragmented) | 18.6  | 22.8          | 21.6 | 21.1 | 20.8 | 20.5 |
> 20.2 |
>
>
+--------------------+-------+---------------+------+------+------+------+------+
>
> Output size - Game package (MB):
> +---------------------+-------+------+------+------+------+------+------+
> | Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
> +---------------------+-------+------+------+------+------+------+------+
> | Plain (total)       | 1149  | 1121 | 1017 |  965 |  938 |  930 |  928 |
> | Plain (DWARF*)      |  845  |  845 |  845 |  845 |  845 |  845 |  845 |
> | Plain (other)       |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
> | Fragmented (total)  | 1044  |  940 |  556 |  373 |  287 |  263 |  255 |
> | Fragmented (DWARF*) |  740  |  664 |  384 |  253 |  194 |  178 |  173 |
> | Fragmented (other)  |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
> +---------------------+-------+------+------+------+------+------+------+
>
> Output size - Clang (MB):
> +---------------------+-------+------+------+------+------+------+------+
> | Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
> +---------------------+-------+------+------+------+------+------+------+
> | Plain (total)       | 2596  | 2546 | 2406 | 2332 | 2293 | 2273 | 2251 |
> | Plain (DWARF*)      | 1979  | 1979 | 1979 | 1979 | 1979 | 1979 | 1979 |
> | Plain (other)       |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
> | Fragmented (total)  | 2397  | 2346 | 2164 | 2069 | 2017 | 1990 | 1963 |
> | Fragmented (DWARF*) | 1780  | 1780 | 1738 | 1716 | 1703 | 1696 | 1691 |
> | Fragmented (other)  |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
> +---------------------+-------+------+------+------+------+------+------+
>
> *DWARF size == total size of .debug_info + .debug_line + .debug_ranges +
> .debug_aranges + .debug_loc
>
> Additionally, I have posted https://reviews.llvm.org/D89229 which
> provides the python script and linker patches used to reproduce the above
> results on my machine. The GC 1/2/3/4/5/6 correspond to the linker option
> added in that patch --mark-live-pc with values 1/0.8/0.6/0.4/0.2/0
> respectively.
>
> During the conference, the question was asked what the memory usage and
> input size impact was. I've summarised these below:
>
> Input file size total (GB):
> +--------------------+------------+
> | Package variant    | Total Size |
> +--------------------+------------+
> | Game (plain)       |     2.9    |
> | Game (fragmented)  |     4.2    |
> | Clang (plain)      |    10.9    |
> | Clang (fragmented) |    12.3    |
> +--------------------+------------+
>
> Peak Working Set Memory usage (GB):
> +--------------------+-------+------+
> | Package variant    | No GC | GC 1 |
> +--------------------+-------+------+
> | Game (plain)       |  4.3  |  4.7 |
> | Game (fragmented)  |  8.9  |  8.6 |
> | Clang (plain)      | 15.7  | 15.6 |
> | Clang (fragmented) | 19.4  | 19.2 |
> +--------------------+-------+------+
>
> I'm keen to hear what people's feedback is, and also interested to
see
> what results others might see by running this experiment on other input
> packages. Also, if anybody has any alternative ideas that meet the goals
> listed below, I'd love to hear them!
>
> To reiterate some key goals of fragmented DWARF, similar to what I said in
> the presentation:
> 1) Devise a scheme that gives significant size savings without being too
> costly. It's clear from just the two packages I've tried this on
that there
> is a fairly hefty link time performance cost, although the exact cost
> depends on the nature of the input package. On the other hand, depending on
> the nature of the input package, there can also be some big gains.
> 2) Devise a scheme that doesn't require any linker knowledge of DWARF.
The
> current approach doesn't quite achieve this properly due to the slight
> misuse of SHF_LINK_ORDER, but I expect that a pivot to using non-COMDAT
> group sections should solve this problem.
> 3) Provide some kind of halfway house between simply writing tombstone
> values into dead DWARF and fully parsing the DWARF to reoptimise
> its/discard the dead bits.
>
> I'm hopeful that changes could be made to the linker to improve the
> link-time cost. There seems to be a significant amount of the link time
> spent creating the input sections. An alternative would be to devise a
> scheme that would avoid the literal splitting into section headers, in
> favour of some sort of list of split-points that the linker uses to split
> things up (a bit like it already does for .eh_frame or mergeable sections).
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201012/0d1dcc5e/attachment.html>

James Henderson via llvm-dev

2020-Oct-13 07:20 UTC

head link

[llvm-dev] Fragmented DWARF

The script included in the patch can be used to convert an object
containing normal DWARF into an object using fragmented DWARF. It does this
by using llvm-dwarfdump to dump the various sections, parses the output to
identify where it should split (using the offsets of the various entries),
and then writes new section headers accordingly - you can see roughly what
it's doing if you get a chance to watch the talk recording. The additional
section headers are appended to the end of the ELF section header table,
whilst the original DWARF is left in the same place it was before (making
use of the fact that section headers don't have to appear in offset order).
The script also parses and fragments the relocation sections targeting the
DWARF sections so that they match up with the fragmented DWARF sections.
This is clearly all suboptimal - in practice the compiler should be
modified to do the fragmenting upfront, to save having to parse a tool's
stdout, but that was just the simplest thing I could come up with to
quickly write the script. Full details of the script usage are included in
the patch description, if you want to play around with it.

If Alexey could point me at the latest version of his patch, I'd be happy
to run that through either or both of the packages I used to see what
happens. Equally, I'd be happy if Alexey is able to run my script to
fragment and measure the performance of a couple of projects he's been
working with. Based purely on the two packages I've tried this with, I can
tell already that the results can vary wildly. My expectation is that
Alexey's approach will be slower (at least in its current form, but
probably more generally), but produce smaller output, but to what scale I
have no idea.

I think linkers parse .eh_frame partly because they have no other choice.
That being said, I think it's format is not too complex, so similarly the
parser isn't too complex. You can see LLD's ELF implementation in
ELF/EhFrame.cpp, how it is used in ELF/InputSection.cpp (see the bits to do
with EhInputSection) and EhFrameSection in ELF/SyntheticSections.h (plus
various usages of these two throughout the LLD code). I think the key to
any structural changes in the DWARF format to make them more amenable to
link-time parsing is being able to read a minimal amount without needing to
parse the payload (e.g. a length field, some sort of type, and then using
the relocations to associate it accordingly).

James

On Mon, 12 Oct 2020 at 20:48, David Blaikie <dblaikie at gmail.com> wrote:
> Awesome! Sorry I missed the lightning talk, but really interested to see
> this sort of thing (though it's not directly/immediately applicable to
the
> use case I work with - Split DWARF, something similar could be used there
> with further work)
>
> Though it looks like the patch has mostly linker changes - where/how do
> you generate the fragmented DWARF to begin with? Via the Python script? Run
> over assembly? I'd be surprised if it was achievable that way - curious
to
> know more.
>
> Got a rough sense/are you able to run apples-to-apples comparisons with
> Alexey's linker-based patches to compare linker time/memory overhead
versus
> resulting output size gains?
>
> (& yeah, I'm a bit curious about how the linkers do eh_frame
rewriting, if
> the format is especially amenable to a lightweight parsing/rewriting and
> how we could make the DWARF more amenable to that too)
>
> On Mon, Oct 12, 2020 at 6:41 AM James Henderson <
> jh7370.2008 at my.bristol.ac.uk> wrote:
>
>> Hi all,
>>
>> At the recent LLVM developers' meeting, I presented a lightning
talk on
>> an approach to reduce the amount of dead debug data left in an
executable
>> following operations such as --gc-sections and duplicate COMDAT
removal. In
>> that presentation, I presented some figures based on linking a game
that
>> had been built by our downstream clang port and fragmented using the
>> described approach. Since recording the presentation, I ran the same
>> experiment on a clang package (this time built with a GCC version). The
>> comparable figures are below:
>>
>> Link-time speed (s):
>>
>>
+--------------------+-------+---------------+------+------+------+------+------+
>> | Package variant    | No GC | GC 1 (normal) | GC 2 | GC 3 | GC 4 | GC
5
>> | GC 6 |
>>
>>
+--------------------+-------+---------------+------+------+------+------+------+
>> | Game (plain)       |  4.5  |  4.9          |  4.2 |  3.6 |  3.4 | 
3.3
>> |  3.2 |
>> | Game (fragmented)  | 11.1  | 11.8          |  9.7 |  8.6 |  7.9 | 
7.7
>> |  7.5 |
>> | Clang (plain)      | 13.9  | 17.9          | 17.0 | 16.7 | 16.3 |
16.2
>> | 16.1 |
>> | Clang (fragmented) | 18.6  | 22.8          | 21.6 | 21.1 | 20.8 |
20.5
>> | 20.2 |
>>
>>
+--------------------+-------+---------------+------+------+------+------+------+
>>
>> Output size - Game package (MB):
>>
+---------------------+-------+------+------+------+------+------+------+
>> | Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6
|
>>
+---------------------+-------+------+------+------+------+------+------+
>> | Plain (total)       | 1149  | 1121 | 1017 |  965 |  938 |  930 |  928
|
>> | Plain (DWARF*)      |  845  |  845 |  845 |  845 |  845 |  845 |  845
|
>> | Plain (other)       |  304  |  276 |  172 |  120 |   93 |   85 |   82
|
>> | Fragmented (total)  | 1044  |  940 |  556 |  373 |  287 |  263 |  255
|
>> | Fragmented (DWARF*) |  740  |  664 |  384 |  253 |  194 |  178 |  173
|
>> | Fragmented (other)  |  304  |  276 |  172 |  120 |   93 |   85 |   82
|
>>
+---------------------+-------+------+------+------+------+------+------+
>>
>> Output size - Clang (MB):
>>
+---------------------+-------+------+------+------+------+------+------+
>> | Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6
|
>>
+---------------------+-------+------+------+------+------+------+------+
>> | Plain (total)       | 2596  | 2546 | 2406 | 2332 | 2293 | 2273 | 2251
|
>> | Plain (DWARF*)      | 1979  | 1979 | 1979 | 1979 | 1979 | 1979 | 1979
|
>> | Plain (other)       |  616  |  567 |  426 |  353 |  314 |  294 |  272
|
>> | Fragmented (total)  | 2397  | 2346 | 2164 | 2069 | 2017 | 1990 | 1963
|
>> | Fragmented (DWARF*) | 1780  | 1780 | 1738 | 1716 | 1703 | 1696 | 1691
|
>> | Fragmented (other)  |  616  |  567 |  426 |  353 |  314 |  294 |  272
|
>>
+---------------------+-------+------+------+------+------+------+------+
>>
>> *DWARF size == total size of .debug_info + .debug_line + .debug_ranges
+
>> .debug_aranges + .debug_loc
>>
>> Additionally, I have posted https://reviews.llvm.org/D89229 which
>> provides the python script and linker patches used to reproduce the
above
>> results on my machine. The GC 1/2/3/4/5/6 correspond to the linker
option
>> added in that patch --mark-live-pc with values 1/0.8/0.6/0.4/0.2/0
>> respectively.
>>
>> During the conference, the question was asked what the memory usage and
>> input size impact was. I've summarised these below:
>>
>> Input file size total (GB):
>> +--------------------+------------+
>> | Package variant    | Total Size |
>> +--------------------+------------+
>> | Game (plain)       |     2.9    |
>> | Game (fragmented)  |     4.2    |
>> | Clang (plain)      |    10.9    |
>> | Clang (fragmented) |    12.3    |
>> +--------------------+------------+
>>
>> Peak Working Set Memory usage (GB):
>> +--------------------+-------+------+
>> | Package variant    | No GC | GC 1 |
>> +--------------------+-------+------+
>> | Game (plain)       |  4.3  |  4.7 |
>> | Game (fragmented)  |  8.9  |  8.6 |
>> | Clang (plain)      | 15.7  | 15.6 |
>> | Clang (fragmented) | 19.4  | 19.2 |
>> +--------------------+-------+------+
>>
>> I'm keen to hear what people's feedback is, and also interested
to see
>> what results others might see by running this experiment on other input
>> packages. Also, if anybody has any alternative ideas that meet the
goals
>> listed below, I'd love to hear them!
>>
>> To reiterate some key goals of fragmented DWARF, similar to what I said
>> in the presentation:
>> 1) Devise a scheme that gives significant size savings without being
too
>> costly. It's clear from just the two packages I've tried this
on that there
>> is a fairly hefty link time performance cost, although the exact cost
>> depends on the nature of the input package. On the other hand,
depending on
>> the nature of the input package, there can also be some big gains.
>> 2) Devise a scheme that doesn't require any linker knowledge of
DWARF.
>> The current approach doesn't quite achieve this properly due to the
slight
>> misuse of SHF_LINK_ORDER, but I expect that a pivot to using non-COMDAT
>> group sections should solve this problem.
>> 3) Provide some kind of halfway house between simply writing tombstone
>> values into dead DWARF and fully parsing the DWARF to reoptimise
>> its/discard the dead bits.
>>
>> I'm hopeful that changes could be made to the linker to improve the
>> link-time cost. There seems to be a significant amount of the link time
>> spent creating the input sections. An alternative would be to devise a
>> scheme that would avoid the literal splitting into section headers, in
>> favour of some sort of list of split-points that the linker uses to
split
>> things up (a bit like it already does for .eh_frame or mergeable
sections).
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201013/d4645262/attachment.html>

llvm dev - Oct 2020 - Fragmented DWARF

[llvm-dev] Fragmented DWARF

[llvm-dev] Fragmented DWARF

[llvm-dev] Fragmented DWARF