thr3ads.net - llvm dev - [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Duncan P. N. Exon Smith

2014-Oct-14 18:40 UTC

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

> On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> Stupid question, but when I was working on LTO last Summer the primary
culprit for excessive memory use was due to us not being smart when linking the
IR together (Espindola would know more details). Do we still have that problem?
For starters, how does the memory usage of just llvm-link compare to the memory
usage of the actual LTO run? If the issue I was seeing last Summer is still
there, you should see that the invocation of llvm-link is actually the most
memory-intensive part of the LTO step, by far.
To be clear, I'm running the command-line:

    $ llvm-lto -exported-symbol _main llvm-lto.lto.bc

Since this is a pre-linked bitcode file, we shouldn't be wasting much
memory from the linking stage.

Running ld64 directly gives a peak memory footprint of ~30GB for the
full link, so there's something else going on there that I'll be
digging into later.
> 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a
single pie slice near 40% of the # of Value's allocated and another at 21%.
Especially this being "step 4".
15.3GB is the peak memory of `llvm-lto`.  This comes late in the
process, after DIEs have been created.  I haven't looked in detail past
debug info metadata, but here's a sketch of what I imagine is in memory
at this point.

  - The IR, including uniquing side-tables.
  - Optimization and backend passes.
  - Parts of SelectionDAG that haven't been freed.
  - `MachineFunction`s and everything inside them.
  - Whatever state the `AsmPrinter`, etc., need.

I expect to look at a couple of other debug-info-related memory usage
areas once I've shrunk the metadata:

  - What's the total footprint of DIEs?  This run has 4M of them, whose
    allocated footprint is ~1GB.  I'm hoping that a deeper look will
    reveal an even larger attack surface.

  - How much do debug info intrinsics cost?  They show up in at least
    three forms -- IR-level, SDNodes, and MachineInstrs -- and there
    can be a lot of them.  How many?  What's their footprint?

For now, I'm focusing on the problem I've already identified.
> You need more data. Right now you have essentially one data point,
I looked at a number of internal C and C++ programs with -flto -g, and
dug deeply into llvm-lto.lto.bc because it's small enough that it's easy
to analyze (and its runtime profile was representative of the other C++
programs I was looking at).

I didn't look deeply at a broad spectrum, but memory usage and runtime
for building clang with -flto -g is something we care a fair bit about.
> and it's not even clear what you measured really. If your goal is
saving memory, I would expect at least a pie chart that breaks down LLVM's
memory usage (not just # of allocations of different sorts; an approximation is
fine, as long as you explain how you arrived at it and in what sense it
approximates the true number).
I'm not sure there's value in diving deeply into everything at once.
I've identified one of the bottlenecks, so I'd like to improve it before
digging into the others.

Here's some visibility into where my numbers come from.

I got the 15.3GB from a profile of memory usage vs. time.  Peak usage
comes late in the process, around when DIEs are being dealt with.

Metadata node counts stabilize much earlier in the process.  The rest of
the numbers are based on counting `MDNodes` and their respective
`MDNodeOperands`, and multiplying by the cost of their operands.  Here's
a dump from around the peak metadata node count:

    LineTables = 7500000[30000000], InlinedLineTables = 6756182, Directives =
7611669[42389128], Arrays = 570609[577447], Others = 1176556[5133065]
    Tag =   256, Count =   554992, Ops =   2531428, Name = DW_TAG_auto_variable
    Tag = 16647, Count =      988, Ops =      4940, Name =
DW_TAG_GNU_template_parameter_pack
    Tag =    52, Count =     9933, Ops =     59598, Name = DW_TAG_variable
    Tag =    33, Count =      190, Ops =       190, Name = DW_TAG_subrange_type
    Tag =    59, Count =        1, Ops =         3, Name =
DW_TAG_unspecified_type
    Tag =    40, Count =    24731, Ops =     24731, Name = DW_TAG_enumerator
    Tag =    21, Count =   354166, Ops =   2833328, Name =
DW_TAG_subroutine_type
    Tag =     2, Count =    77999, Ops =    623992, Name = DW_TAG_class_type
    Tag =    47, Count =    27122, Ops =    108488, Name =
DW_TAG_template_type_parameter
    Tag =    28, Count =     8491, Ops =     33964, Name = DW_TAG_inheritance
    Tag =    66, Count =    10930, Ops =     43720, Name =
DW_TAG_rvalue_reference_type
    Tag =    16, Count =    54680, Ops =    218720, Name = DW_TAG_reference_type
    Tag =    23, Count =      624, Ops =      4992, Name = DW_TAG_union_type
    Tag =     4, Count =     5344, Ops =     42752, Name =
DW_TAG_enumeration_type
    Tag =    11, Count =   360390, Ops =   1081170, Name = DW_TAG_lexical_block
    Tag =   258, Count =        1, Ops =         1, Name = DW_TAG_expression
    Tag =    13, Count =    73880, Ops =    299110, Name = DW_TAG_member
    Tag =    58, Count =     1387, Ops =      4161, Name =
DW_TAG_imported_module
    Tag =     1, Count =     2747, Ops =     21976, Name = DW_TAG_array_type
    Tag =    46, Count =  1341021, Ops =  12069189, Name = DW_TAG_subprogram
    Tag =   257, Count =  4373879, Ops =  20785065, Name = DW_TAG_arg_variable
    Tag =     8, Count =     2246, Ops =      6738, Name =
DW_TAG_imported_declaration
    Tag =    53, Count =       57, Ops =       228, Name = DW_TAG_volatile_type
    Tag =    15, Count =    55163, Ops =    220652, Name = DW_TAG_pointer_type
    Tag =    41, Count =     3382, Ops =      6764, Name = DW_TAG_file_type
    Tag =    22, Count =   158479, Ops =    633916, Name = DW_TAG_typedef
    Tag =    48, Count =      486, Ops =      2430, Name =
DW_TAG_template_value_parameter
    Tag =    36, Count =       15, Ops =        45, Name = DW_TAG_base_type
    Tag =    17, Count =     1164, Ops =      8148, Name = DW_TAG_compile_unit
    Tag =    31, Count =       19, Ops =        95, Name =
DW_TAG_ptr_to_member_type
    Tag =    57, Count =     2034, Ops =      6102, Name = DW_TAG_namespace
    Tag =    38, Count =    32133, Ops =    128532, Name = DW_TAG_const_type
    Tag =    19, Count =    72995, Ops =    583960, Name = DW_TAG_structure_type

(Note: the InlinedLineTables stat is included in LineTables stat.)

You can determine the rough memory footprint of each type of node by
multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the
"Ops"
by `sizeof(MDNodeOperand)` (x86-64: 32B).

Overall, there are 7.5M linetables with 30M operands, so by this method
their footprint is ~1.3GB.  There are 7.6M descriptors with 42.4M
operands, so their footprint is ~1.7GB.

I dumped another stat periodically to tell me the peak size of the
side-tables for line table entries, which are split into "Scopes" (for
non-inlined) and "Inlined" (these counts are disjoint, unlike the
previous stats):

    Scopes = 203166 [203166], Inlined = 3500000 [3500000]

I assumed that both `DenseMap` and `std::vector` over-allocate by 50%
to estimate the current (and planned) costs for the side-tables.

Another stat I dumped periodically was the breakdown between V(alues),
U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings).
Here's a sample from nearby:

    V = 23967800 (40200000 - 16232200)
    U =  5850877 ( 7365503 -  1514626)
    C =   205491 (  279134 -    73643)
    M = 16837368 (31009291 - 14171923)
    S =   693869 (  693869 -        0)

Lastly, I dumped a breakdown of the types of MDNodeOperands.  This is
also a sample from nearby:

    MDOps = 77644750 (100%)
    Const = 14947077 ( 19%)
    Node  = 41749475 ( 53%)
    Str   =  9553581 ( 12%)
    Null  = 10976693 ( 14%)
    Other =   417924 (  0%)

While I didn't use this breakdown for my memory estimates, it was
interesting nevertheless.  Note the following:

  - The number of constants is just under 15M.  This dump came less than
    a second before the dump above, where we have 7.5M line table
    entries.  Line table entries have 2 operands of `ConstantInt`.  This
    lines up nicely.
    
    Note: this checked `isa<Constant>(Op) &&
!isa<GlobalValue>(Op)`.

  - There are a lot of null operands.  By making subclasses for the
    various types of debug info IR, we can probably shed some of these
    altogether.

  - There are few "Other" operands.  These are likely all
`GlobalValue`
    references, and are the only operands that need to be referenced
    using value handles.

Sean Silva

2014-Oct-17 05:09 UTC

head link

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

On Tue, Oct 14, 2014 at 11:40 AM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:
>
> > On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at
gmail.com> wrote:
> >
> > Stupid question, but when I was working on LTO last Summer the primary
> culprit for excessive memory use was due to us not being smart when linking
> the IR together (Espindola would know more details). Do we still have that
> problem? For starters, how does the memory usage of just llvm-link compare
> to the memory usage of the actual LTO run? If the issue I was seeing last
> Summer is still there, you should see that the invocation of llvm-link is
> actually the most memory-intensive part of the LTO step, by far.
>
> To be clear, I'm running the command-line:
>
>     $ llvm-lto -exported-symbol _main llvm-lto.lto.bc
>
> Since this is a pre-linked bitcode file, we shouldn't be wasting much
> memory from the linking stage.
>
> Running ld64 directly gives a peak memory footprint of ~30GB for the
> full link, so there's something else going on there that I'll be
> digging into later.
>
Dig into this first!

In the OP you are talking about essentially a pure "optimization" (in
the
programmer-wisdom "beware of it" sense), to "save" 2GB of
peak memory. But
from your analysis it's not clear that this 2GB savings actually is
reflected as peak memory usage saving (since the ~30GB peak might be
happening elsewhere in the LTO process). It is this ~30GB peak, and not the
one you originally analyzed, which your customers presumably care about.

Be realistic: ~15GB (equivalent to *all the memory you tracked combined*)
is flying under the radar of your analysis. It would be very surprising if
this has not led to multiple erroneous conclusions about what direction to
take things in your plan.

To recap, the analysis you performed seems to support neither of the
following conclusions:
- Peak memory usage during LTO would be improved by this plan
- Build time for LTO would be improved by this plan (from what you have
posted, you didn't measure time at all)

I hesitate to consider such an "aggressive plan" (as you describe it)
with
neither of the above two legs to stand on.

Of course, this is all tangential to the discussion of e.g. a more
readable/writable .ll form for debug info, or debug info compatibility.
However, it seems like you jumped into this from the point of view of it
being an optimization, rather than a maintainability/compatibility thing.

-- Sean Silva

>
> > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we
have
> a single pie slice near 40% of the # of Value's allocated and another
at
> 21%. Especially this being "step 4".
>
> 15.3GB is the peak memory of `llvm-lto`.  This comes late in the
> process, after DIEs have been created.  I haven't looked in detail past
> debug info metadata, but here's a sketch of what I imagine is in memory
> at this point.
>
>   - The IR, including uniquing side-tables.
>   - Optimization and backend passes.
>   - Parts of SelectionDAG that haven't been freed.
>   - `MachineFunction`s and everything inside them.
>   - Whatever state the `AsmPrinter`, etc., need.
>
> I expect to look at a couple of other debug-info-related memory usage
> areas once I've shrunk the metadata:
>
>   - What's the total footprint of DIEs?  This run has 4M of them, whose
>     allocated footprint is ~1GB.  I'm hoping that a deeper look will
>     reveal an even larger attack surface.
>
>   - How much do debug info intrinsics cost?  They show up in at least
>     three forms -- IR-level, SDNodes, and MachineInstrs -- and there
>     can be a lot of them.  How many?  What's their footprint?
>
> For now, I'm focusing on the problem I've already identified.
>
> > You need more data. Right now you have essentially one data point,
>
> I looked at a number of internal C and C++ programs with -flto -g, and
> dug deeply into llvm-lto.lto.bc because it's small enough that it's
easy
> to analyze (and its runtime profile was representative of the other C++
> programs I was looking at).
>
> I didn't look deeply at a broad spectrum, but memory usage and runtime
> for building clang with -flto -g is something we care a fair bit about.
>
> > and it's not even clear what you measured really. If your goal is
saving
> memory, I would expect at least a pie chart that breaks down LLVM's
memory
> usage (not just # of allocations of different sorts; an approximation is
> fine, as long as you explain how you arrived at it and in what sense it
> approximates the true number).
>
> I'm not sure there's value in diving deeply into everything at
once.
> I've identified one of the bottlenecks, so I'd like to improve it
before
> digging into the others.
>
> Here's some visibility into where my numbers come from.
>
> I got the 15.3GB from a profile of memory usage vs. time.  Peak usage
> comes late in the process, around when DIEs are being dealt with.
>
> Metadata node counts stabilize much earlier in the process.  The rest of
> the numbers are based on counting `MDNodes` and their respective
> `MDNodeOperands`, and multiplying by the cost of their operands. 
Here's
> a dump from around the peak metadata node count:
>
>     LineTables = 7500000[30000000], InlinedLineTables = 6756182,
> Directives = 7611669[42389128], Arrays = 570609[577447], Others >
1176556[5133065]
>     Tag =   256, Count =   554992, Ops =   2531428, Name >
DW_TAG_auto_variable
>     Tag = 16647, Count =      988, Ops =      4940, Name >
DW_TAG_GNU_template_parameter_pack
>     Tag =    52, Count =     9933, Ops =     59598, Name = DW_TAG_variable
>     Tag =    33, Count =      190, Ops =       190, Name >
DW_TAG_subrange_type
>     Tag =    59, Count =        1, Ops =         3, Name >
DW_TAG_unspecified_type
>     Tag =    40, Count =    24731, Ops =     24731, Name >
DW_TAG_enumerator
>     Tag =    21, Count =   354166, Ops =   2833328, Name >
DW_TAG_subroutine_type
>     Tag =     2, Count =    77999, Ops =    623992, Name >
DW_TAG_class_type
>     Tag =    47, Count =    27122, Ops =    108488, Name >
DW_TAG_template_type_parameter
>     Tag =    28, Count =     8491, Ops =     33964, Name >
DW_TAG_inheritance
>     Tag =    66, Count =    10930, Ops =     43720, Name >
DW_TAG_rvalue_reference_type
>     Tag =    16, Count =    54680, Ops =    218720, Name >
DW_TAG_reference_type
>     Tag =    23, Count =      624, Ops =      4992, Name >
DW_TAG_union_type
>     Tag =     4, Count =     5344, Ops =     42752, Name >
DW_TAG_enumeration_type
>     Tag =    11, Count =   360390, Ops =   1081170, Name >
DW_TAG_lexical_block
>     Tag =   258, Count =        1, Ops =         1, Name >
DW_TAG_expression
>     Tag =    13, Count =    73880, Ops =    299110, Name = DW_TAG_member
>     Tag =    58, Count =     1387, Ops =      4161, Name >
DW_TAG_imported_module
>     Tag =     1, Count =     2747, Ops =     21976, Name >
DW_TAG_array_type
>     Tag =    46, Count =  1341021, Ops =  12069189, Name >
DW_TAG_subprogram
>     Tag =   257, Count =  4373879, Ops =  20785065, Name >
DW_TAG_arg_variable
>     Tag =     8, Count =     2246, Ops =      6738, Name >
DW_TAG_imported_declaration
>     Tag =    53, Count =       57, Ops =       228, Name >
DW_TAG_volatile_type
>     Tag =    15, Count =    55163, Ops =    220652, Name >
DW_TAG_pointer_type
>     Tag =    41, Count =     3382, Ops =      6764, Name = DW_TAG_file_type
>     Tag =    22, Count =   158479, Ops =    633916, Name = DW_TAG_typedef
>     Tag =    48, Count =      486, Ops =      2430, Name >
DW_TAG_template_value_parameter
>     Tag =    36, Count =       15, Ops =        45, Name = DW_TAG_base_type
>     Tag =    17, Count =     1164, Ops =      8148, Name >
DW_TAG_compile_unit
>     Tag =    31, Count =       19, Ops =        95, Name >
DW_TAG_ptr_to_member_type
>     Tag =    57, Count =     2034, Ops =      6102, Name = DW_TAG_namespace
>     Tag =    38, Count =    32133, Ops =    128532, Name >
DW_TAG_const_type
>     Tag =    19, Count =    72995, Ops =    583960, Name >
DW_TAG_structure_type
>
> (Note: the InlinedLineTables stat is included in LineTables stat.)
>
> You can determine the rough memory footprint of each type of node by
> multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the
"Ops"
> by `sizeof(MDNodeOperand)` (x86-64: 32B).
>
> Overall, there are 7.5M linetables with 30M operands, so by this method
> their footprint is ~1.3GB.  There are 7.6M descriptors with 42.4M
> operands, so their footprint is ~1.7GB.
>
> I dumped another stat periodically to tell me the peak size of the
> side-tables for line table entries, which are split into "Scopes"
(for
> non-inlined) and "Inlined" (these counts are disjoint, unlike the
> previous stats):
>
>     Scopes = 203166 [203166], Inlined = 3500000 [3500000]
>
> I assumed that both `DenseMap` and `std::vector` over-allocate by 50%
> to estimate the current (and planned) costs for the side-tables.
>
> Another stat I dumped periodically was the breakdown between V(alues),
> U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings).
> Here's a sample from nearby:
>
>     V = 23967800 (40200000 - 16232200)
>     U =  5850877 ( 7365503 -  1514626)
>     C =   205491 (  279134 -    73643)
>     M = 16837368 (31009291 - 14171923)
>     S =   693869 (  693869 -        0)
>
> Lastly, I dumped a breakdown of the types of MDNodeOperands.  This is
> also a sample from nearby:
>
>     MDOps = 77644750 (100%)
>     Const = 14947077 ( 19%)
>     Node  = 41749475 ( 53%)
>     Str   =  9553581 ( 12%)
>     Null  = 10976693 ( 14%)
>     Other =   417924 (  0%)
>
> While I didn't use this breakdown for my memory estimates, it was
> interesting nevertheless.  Note the following:
>
>   - The number of constants is just under 15M.  This dump came less than
>     a second before the dump above, where we have 7.5M line table
>     entries.  Line table entries have 2 operands of `ConstantInt`.  This
>     lines up nicely.
>
>     Note: this checked `isa<Constant>(Op) &&
!isa<GlobalValue>(Op)`.
>
>   - There are a lot of null operands.  By making subclasses for the
>     various types of debug info IR, we can probably shed some of these
>     altogether.
>
>   - There are few "Other" operands.  These are likely all
`GlobalValue`
>     references, and are the only operands that need to be referenced
>     using value handles.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141016/d2b8face/attachment.html>

Duncan P. N. Exon Smith

2014-Oct-17 15:47 UTC

head link

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

> On 2014 Oct 16, at 22:09, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> Dig into this first!
This isn't the right forum for digging into ld64.
> In the OP you are talking about essentially a pure "optimization"
(in the programmer-wisdom "beware of it" sense), to "save"
2GB of peak memory. But from your analysis it's not clear that this 2GB
savings actually is reflected as peak memory usage saving
It's reflected in both links.
> (since the ~30GB peak might be happening elsewhere in the LTO process). It
is this ~30GB peak, and not the one you originally analyzed, which your
customers presumably care about.
This discussion is intentionally focused on llvm-lto.
> To recap, the analysis you performed seems to support neither of the
following conclusions:
> - Peak memory usage during LTO would be improved by this plan
The analysis is based on the nodes allocated at peak memory.
> - Build time for LTO would be improved by this plan (from what you have
posted, you didn't measure time at all)
CPU profiles blame 25-35% of the CPU of the ld64 LTO link on
callback-based metadata RAUW traffic, depending on the C++ program.
> Of course, this is all tangential to the discussion of e.g. a more
readable/writable .ll form for debug info, or debug info compatibility. However,
it seems like you jumped into this from the point of view of it being an
optimization, rather than a maintainability/compatibility thing.
It's both.

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2014 - [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

Possibly Parallel Threads