Duncan P. N. Exon Smith
2014-Oct-14 18:40 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
> On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote: > > Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far.To be clear, I'm running the command-line: $ llvm-lto -exported-symbol _main llvm-lto.lto.bc Since this is a pre-linked bitcode file, we shouldn't be wasting much memory from the linking stage. Running ld64 directly gives a peak memory footprint of ~30GB for the full link, so there's something else going on there that I'll be digging into later.> 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4".15.3GB is the peak memory of `llvm-lto`. This comes late in the process, after DIEs have been created. I haven't looked in detail past debug info metadata, but here's a sketch of what I imagine is in memory at this point. - The IR, including uniquing side-tables. - Optimization and backend passes. - Parts of SelectionDAG that haven't been freed. - `MachineFunction`s and everything inside them. - Whatever state the `AsmPrinter`, etc., need. I expect to look at a couple of other debug-info-related memory usage areas once I've shrunk the metadata: - What's the total footprint of DIEs? This run has 4M of them, whose allocated footprint is ~1GB. I'm hoping that a deeper look will reveal an even larger attack surface. - How much do debug info intrinsics cost? They show up in at least three forms -- IR-level, SDNodes, and MachineInstrs -- and there can be a lot of them. How many? What's their footprint? For now, I'm focusing on the problem I've already identified.> You need more data. Right now you have essentially one data point,I looked at a number of internal C and C++ programs with -flto -g, and dug deeply into llvm-lto.lto.bc because it's small enough that it's easy to analyze (and its runtime profile was representative of the other C++ programs I was looking at). I didn't look deeply at a broad spectrum, but memory usage and runtime for building clang with -flto -g is something we care a fair bit about.> and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number).I'm not sure there's value in diving deeply into everything at once. I've identified one of the bottlenecks, so I'd like to improve it before digging into the others. Here's some visibility into where my numbers come from. I got the 15.3GB from a profile of memory usage vs. time. Peak usage comes late in the process, around when DIEs are being dealt with. Metadata node counts stabilize much earlier in the process. The rest of the numbers are based on counting `MDNodes` and their respective `MDNodeOperands`, and multiplying by the cost of their operands. Here's a dump from around the peak metadata node count: LineTables = 7500000[30000000], InlinedLineTables = 6756182, Directives = 7611669[42389128], Arrays = 570609[577447], Others = 1176556[5133065] Tag = 256, Count = 554992, Ops = 2531428, Name = DW_TAG_auto_variable Tag = 16647, Count = 988, Ops = 4940, Name = DW_TAG_GNU_template_parameter_pack Tag = 52, Count = 9933, Ops = 59598, Name = DW_TAG_variable Tag = 33, Count = 190, Ops = 190, Name = DW_TAG_subrange_type Tag = 59, Count = 1, Ops = 3, Name = DW_TAG_unspecified_type Tag = 40, Count = 24731, Ops = 24731, Name = DW_TAG_enumerator Tag = 21, Count = 354166, Ops = 2833328, Name = DW_TAG_subroutine_type Tag = 2, Count = 77999, Ops = 623992, Name = DW_TAG_class_type Tag = 47, Count = 27122, Ops = 108488, Name = DW_TAG_template_type_parameter Tag = 28, Count = 8491, Ops = 33964, Name = DW_TAG_inheritance Tag = 66, Count = 10930, Ops = 43720, Name = DW_TAG_rvalue_reference_type Tag = 16, Count = 54680, Ops = 218720, Name = DW_TAG_reference_type Tag = 23, Count = 624, Ops = 4992, Name = DW_TAG_union_type Tag = 4, Count = 5344, Ops = 42752, Name = DW_TAG_enumeration_type Tag = 11, Count = 360390, Ops = 1081170, Name = DW_TAG_lexical_block Tag = 258, Count = 1, Ops = 1, Name = DW_TAG_expression Tag = 13, Count = 73880, Ops = 299110, Name = DW_TAG_member Tag = 58, Count = 1387, Ops = 4161, Name = DW_TAG_imported_module Tag = 1, Count = 2747, Ops = 21976, Name = DW_TAG_array_type Tag = 46, Count = 1341021, Ops = 12069189, Name = DW_TAG_subprogram Tag = 257, Count = 4373879, Ops = 20785065, Name = DW_TAG_arg_variable Tag = 8, Count = 2246, Ops = 6738, Name = DW_TAG_imported_declaration Tag = 53, Count = 57, Ops = 228, Name = DW_TAG_volatile_type Tag = 15, Count = 55163, Ops = 220652, Name = DW_TAG_pointer_type Tag = 41, Count = 3382, Ops = 6764, Name = DW_TAG_file_type Tag = 22, Count = 158479, Ops = 633916, Name = DW_TAG_typedef Tag = 48, Count = 486, Ops = 2430, Name = DW_TAG_template_value_parameter Tag = 36, Count = 15, Ops = 45, Name = DW_TAG_base_type Tag = 17, Count = 1164, Ops = 8148, Name = DW_TAG_compile_unit Tag = 31, Count = 19, Ops = 95, Name = DW_TAG_ptr_to_member_type Tag = 57, Count = 2034, Ops = 6102, Name = DW_TAG_namespace Tag = 38, Count = 32133, Ops = 128532, Name = DW_TAG_const_type Tag = 19, Count = 72995, Ops = 583960, Name = DW_TAG_structure_type (Note: the InlinedLineTables stat is included in LineTables stat.) You can determine the rough memory footprint of each type of node by multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the "Ops" by `sizeof(MDNodeOperand)` (x86-64: 32B). Overall, there are 7.5M linetables with 30M operands, so by this method their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M operands, so their footprint is ~1.7GB. I dumped another stat periodically to tell me the peak size of the side-tables for line table entries, which are split into "Scopes" (for non-inlined) and "Inlined" (these counts are disjoint, unlike the previous stats): Scopes = 203166 [203166], Inlined = 3500000 [3500000] I assumed that both `DenseMap` and `std::vector` over-allocate by 50% to estimate the current (and planned) costs for the side-tables. Another stat I dumped periodically was the breakdown between V(alues), U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings). Here's a sample from nearby: V = 23967800 (40200000 - 16232200) U = 5850877 ( 7365503 - 1514626) C = 205491 ( 279134 - 73643) M = 16837368 (31009291 - 14171923) S = 693869 ( 693869 - 0) Lastly, I dumped a breakdown of the types of MDNodeOperands. This is also a sample from nearby: MDOps = 77644750 (100%) Const = 14947077 ( 19%) Node = 41749475 ( 53%) Str = 9553581 ( 12%) Null = 10976693 ( 14%) Other = 417924 ( 0%) While I didn't use this breakdown for my memory estimates, it was interesting nevertheless. Note the following: - The number of constants is just under 15M. This dump came less than a second before the dump above, where we have 7.5M line table entries. Line table entries have 2 operands of `ConstantInt`. This lines up nicely. Note: this checked `isa<Constant>(Op) && !isa<GlobalValue>(Op)`. - There are a lot of null operands. By making subclasses for the various types of debug info IR, we can probably shed some of these altogether. - There are few "Other" operands. These are likely all `GlobalValue` references, and are the only operands that need to be referenced using value handles.
Sean Silva
2014-Oct-17 05:09 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Tue, Oct 14, 2014 at 11:40 AM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> > > On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote: > > > > Stupid question, but when I was working on LTO last Summer the primary > culprit for excessive memory use was due to us not being smart when linking > the IR together (Espindola would know more details). Do we still have that > problem? For starters, how does the memory usage of just llvm-link compare > to the memory usage of the actual LTO run? If the issue I was seeing last > Summer is still there, you should see that the invocation of llvm-link is > actually the most memory-intensive part of the LTO step, by far. > > To be clear, I'm running the command-line: > > $ llvm-lto -exported-symbol _main llvm-lto.lto.bc > > Since this is a pre-linked bitcode file, we shouldn't be wasting much > memory from the linking stage. > > Running ld64 directly gives a peak memory footprint of ~30GB for the > full link, so there's something else going on there that I'll be > digging into later. >Dig into this first! In the OP you are talking about essentially a pure "optimization" (in the programmer-wisdom "beware of it" sense), to "save" 2GB of peak memory. But from your analysis it's not clear that this 2GB savings actually is reflected as peak memory usage saving (since the ~30GB peak might be happening elsewhere in the LTO process). It is this ~30GB peak, and not the one you originally analyzed, which your customers presumably care about. Be realistic: ~15GB (equivalent to *all the memory you tracked combined*) is flying under the radar of your analysis. It would be very surprising if this has not led to multiple erroneous conclusions about what direction to take things in your plan. To recap, the analysis you performed seems to support neither of the following conclusions: - Peak memory usage during LTO would be improved by this plan - Build time for LTO would be improved by this plan (from what you have posted, you didn't measure time at all) I hesitate to consider such an "aggressive plan" (as you describe it) with neither of the above two legs to stand on. Of course, this is all tangential to the discussion of e.g. a more readable/writable .ll form for debug info, or debug info compatibility. However, it seems like you jumped into this from the point of view of it being an optimization, rather than a maintainability/compatibility thing. -- Sean Silva> > > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have > a single pie slice near 40% of the # of Value's allocated and another at > 21%. Especially this being "step 4". > > 15.3GB is the peak memory of `llvm-lto`. This comes late in the > process, after DIEs have been created. I haven't looked in detail past > debug info metadata, but here's a sketch of what I imagine is in memory > at this point. > > - The IR, including uniquing side-tables. > - Optimization and backend passes. > - Parts of SelectionDAG that haven't been freed. > - `MachineFunction`s and everything inside them. > - Whatever state the `AsmPrinter`, etc., need. > > I expect to look at a couple of other debug-info-related memory usage > areas once I've shrunk the metadata: > > - What's the total footprint of DIEs? This run has 4M of them, whose > allocated footprint is ~1GB. I'm hoping that a deeper look will > reveal an even larger attack surface. > > - How much do debug info intrinsics cost? They show up in at least > three forms -- IR-level, SDNodes, and MachineInstrs -- and there > can be a lot of them. How many? What's their footprint? > > For now, I'm focusing on the problem I've already identified. > > > You need more data. Right now you have essentially one data point, > > I looked at a number of internal C and C++ programs with -flto -g, and > dug deeply into llvm-lto.lto.bc because it's small enough that it's easy > to analyze (and its runtime profile was representative of the other C++ > programs I was looking at). > > I didn't look deeply at a broad spectrum, but memory usage and runtime > for building clang with -flto -g is something we care a fair bit about. > > > and it's not even clear what you measured really. If your goal is saving > memory, I would expect at least a pie chart that breaks down LLVM's memory > usage (not just # of allocations of different sorts; an approximation is > fine, as long as you explain how you arrived at it and in what sense it > approximates the true number). > > I'm not sure there's value in diving deeply into everything at once. > I've identified one of the bottlenecks, so I'd like to improve it before > digging into the others. > > Here's some visibility into where my numbers come from. > > I got the 15.3GB from a profile of memory usage vs. time. Peak usage > comes late in the process, around when DIEs are being dealt with. > > Metadata node counts stabilize much earlier in the process. The rest of > the numbers are based on counting `MDNodes` and their respective > `MDNodeOperands`, and multiplying by the cost of their operands. Here's > a dump from around the peak metadata node count: > > LineTables = 7500000[30000000], InlinedLineTables = 6756182, > Directives = 7611669[42389128], Arrays = 570609[577447], Others > 1176556[5133065] > Tag = 256, Count = 554992, Ops = 2531428, Name > DW_TAG_auto_variable > Tag = 16647, Count = 988, Ops = 4940, Name > DW_TAG_GNU_template_parameter_pack > Tag = 52, Count = 9933, Ops = 59598, Name = DW_TAG_variable > Tag = 33, Count = 190, Ops = 190, Name > DW_TAG_subrange_type > Tag = 59, Count = 1, Ops = 3, Name > DW_TAG_unspecified_type > Tag = 40, Count = 24731, Ops = 24731, Name > DW_TAG_enumerator > Tag = 21, Count = 354166, Ops = 2833328, Name > DW_TAG_subroutine_type > Tag = 2, Count = 77999, Ops = 623992, Name > DW_TAG_class_type > Tag = 47, Count = 27122, Ops = 108488, Name > DW_TAG_template_type_parameter > Tag = 28, Count = 8491, Ops = 33964, Name > DW_TAG_inheritance > Tag = 66, Count = 10930, Ops = 43720, Name > DW_TAG_rvalue_reference_type > Tag = 16, Count = 54680, Ops = 218720, Name > DW_TAG_reference_type > Tag = 23, Count = 624, Ops = 4992, Name > DW_TAG_union_type > Tag = 4, Count = 5344, Ops = 42752, Name > DW_TAG_enumeration_type > Tag = 11, Count = 360390, Ops = 1081170, Name > DW_TAG_lexical_block > Tag = 258, Count = 1, Ops = 1, Name > DW_TAG_expression > Tag = 13, Count = 73880, Ops = 299110, Name = DW_TAG_member > Tag = 58, Count = 1387, Ops = 4161, Name > DW_TAG_imported_module > Tag = 1, Count = 2747, Ops = 21976, Name > DW_TAG_array_type > Tag = 46, Count = 1341021, Ops = 12069189, Name > DW_TAG_subprogram > Tag = 257, Count = 4373879, Ops = 20785065, Name > DW_TAG_arg_variable > Tag = 8, Count = 2246, Ops = 6738, Name > DW_TAG_imported_declaration > Tag = 53, Count = 57, Ops = 228, Name > DW_TAG_volatile_type > Tag = 15, Count = 55163, Ops = 220652, Name > DW_TAG_pointer_type > Tag = 41, Count = 3382, Ops = 6764, Name = DW_TAG_file_type > Tag = 22, Count = 158479, Ops = 633916, Name = DW_TAG_typedef > Tag = 48, Count = 486, Ops = 2430, Name > DW_TAG_template_value_parameter > Tag = 36, Count = 15, Ops = 45, Name = DW_TAG_base_type > Tag = 17, Count = 1164, Ops = 8148, Name > DW_TAG_compile_unit > Tag = 31, Count = 19, Ops = 95, Name > DW_TAG_ptr_to_member_type > Tag = 57, Count = 2034, Ops = 6102, Name = DW_TAG_namespace > Tag = 38, Count = 32133, Ops = 128532, Name > DW_TAG_const_type > Tag = 19, Count = 72995, Ops = 583960, Name > DW_TAG_structure_type > > (Note: the InlinedLineTables stat is included in LineTables stat.) > > You can determine the rough memory footprint of each type of node by > multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the "Ops" > by `sizeof(MDNodeOperand)` (x86-64: 32B). > > Overall, there are 7.5M linetables with 30M operands, so by this method > their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M > operands, so their footprint is ~1.7GB. > > I dumped another stat periodically to tell me the peak size of the > side-tables for line table entries, which are split into "Scopes" (for > non-inlined) and "Inlined" (these counts are disjoint, unlike the > previous stats): > > Scopes = 203166 [203166], Inlined = 3500000 [3500000] > > I assumed that both `DenseMap` and `std::vector` over-allocate by 50% > to estimate the current (and planned) costs for the side-tables. > > Another stat I dumped periodically was the breakdown between V(alues), > U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings). > Here's a sample from nearby: > > V = 23967800 (40200000 - 16232200) > U = 5850877 ( 7365503 - 1514626) > C = 205491 ( 279134 - 73643) > M = 16837368 (31009291 - 14171923) > S = 693869 ( 693869 - 0) > > Lastly, I dumped a breakdown of the types of MDNodeOperands. This is > also a sample from nearby: > > MDOps = 77644750 (100%) > Const = 14947077 ( 19%) > Node = 41749475 ( 53%) > Str = 9553581 ( 12%) > Null = 10976693 ( 14%) > Other = 417924 ( 0%) > > While I didn't use this breakdown for my memory estimates, it was > interesting nevertheless. Note the following: > > - The number of constants is just under 15M. This dump came less than > a second before the dump above, where we have 7.5M line table > entries. Line table entries have 2 operands of `ConstantInt`. This > lines up nicely. > > Note: this checked `isa<Constant>(Op) && !isa<GlobalValue>(Op)`. > > - There are a lot of null operands. By making subclasses for the > various types of debug info IR, we can probably shed some of these > altogether. > > - There are few "Other" operands. These are likely all `GlobalValue` > references, and are the only operands that need to be referenced > using value handles. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141016/d2b8face/attachment.html>
Duncan P. N. Exon Smith
2014-Oct-17 15:47 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
> On 2014 Oct 16, at 22:09, Sean Silva <chisophugis at gmail.com> wrote: > > Dig into this first!This isn't the right forum for digging into ld64.> In the OP you are talking about essentially a pure "optimization" (in the programmer-wisdom "beware of it" sense), to "save" 2GB of peak memory. But from your analysis it's not clear that this 2GB savings actually is reflected as peak memory usage savingIt's reflected in both links.> (since the ~30GB peak might be happening elsewhere in the LTO process). It is this ~30GB peak, and not the one you originally analyzed, which your customers presumably care about.This discussion is intentionally focused on llvm-lto.> To recap, the analysis you performed seems to support neither of the following conclusions: > - Peak memory usage during LTO would be improved by this planThe analysis is based on the nodes allocated at peak memory.> - Build time for LTO would be improved by this plan (from what you have posted, you didn't measure time at all)CPU profiles blame 25-35% of the CPU of the ld64 LTO link on callback-based metadata RAUW traffic, depending on the C++ program.> Of course, this is all tangential to the discussion of e.g. a more readable/writable .ll form for debug info, or debug info compatibility. However, it seems like you jumped into this from the point of view of it being an optimization, rather than a maintainability/compatibility thing.It's both.