Duncan P. N. Exon Smith
2014-Oct-13  22:02 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
In r219010, I merged integer and string fields into a single header
field.  By reducing the number of metadata operands used in debug info,
this saved 2.2GB on an `llvm-lto` bootstrap.  I've done some profiling
of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
I've concluded that they will be insufficient.
Instead, I'd like to implement a more aggressive plan, which as a
side-effect cleans up the much "loved" debug info IR assembly syntax.
At a high-level, the idea is to create distinct subclasses of `Value`
for each debug info concept, starting with line table entries and moving
on to the DIDescriptor hierarchy.  By leveraging the use-list
infrastructure for metadata operands -- i.e., only using value handles
for non-metadata operands -- we'll improve memory usage and increase
RAUW speed.
My rough plan follows.  I quote some numbers for memory savings below
based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
-save-temps option) that currently peaks at 15.3GB.
 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
    must all be metadata.  The cost per operand is 1 pointer, vs. 4
    pointers in an `MDNode`.
 2. Create `MDLineTable` as the first subclass of `MDUser`.  Use normal
    fields (not `Value`s) for the line and column, and use `Use`
    operands for the metadata operands.
    On x86-64, this will save 104B / line table entry.  Linking
    `llvm-lto` uses ~7M line-table entries, so this on its own saves
    ~700MB.
    Sketch of class definition:
        class MDLineTable : public MDUser {
          unsigned Line;
          unsigned Column;
        public:
          static MDLineTable *get(unsigned Line, unsigned Column,
                                  MDNode *Scope);
          static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);
          static MDLineTable *getBase(MDLineTable *Inlined);
          unsigned getLine() const { return Line; }
          unsigned getColumn() const { return Column; }
          bool isInlined() const { return getNumOperands() == 2; }
          MDNode *getScope() const { return getOperand(0); }
          MDNode *getInlinedAt() const { return getOperand(1); }
        };
    Proposed assembly syntax:
        ; Not inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)
        ; Inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,
                                   inlinedAt: metadata !10)
        ; Column defaulted to 0.
        !7 = metadata !MDLineTable(line: 45, scope: metadata !9)
    (What colour should that bike shed be?)
 3. (Optional) Rewrite `DebugLoc` lookup tables.  My profiling shows
    that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line
    table entries.  The cost of these is ~180B each, for another
    ~600MB.
    If we integrate a side-table of `MDLineTable`s into its uniquing,
    the overhead is only ~12B / line table entry, or ~80MB.  This saves
    520MB.
    This is somewhat perpendicular to redesigning the metadata format,
    but IMO it's worth doing as soon as it's possible.
 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
    through an intermediate class `DebugMDNode` with an
    allocation-time-optional `CallbackVH` available for referencing
    non-metadata.  Change `DIDescriptor` to wrap a `DebugMDNode` instead
    of an `MDNode`.
    This saves another ~960MB, for a running total of ~2GB.
    Proposed assembly syntax:
        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
                                          fields: "0\00clang
3.6\00...",
                                          operands: { metadata !8, ... })
        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
                                          fields: "global_var\00...",
                                          operands: { metadata !8, ... },
                                          handle: i32* @global_var)
    This syntax pulls the tag out of the current header-string, calls
    the rest of the header "fields", and includes the metadata
operands
    in "operands".
 5. Incrementally create subclasses of `DebugMDNode`, such as
    `MDCompileUnit` and `MDSubprogram`.  Sub-classed nodes replace the
    "fields" and "operands" catch-alls with explicit names
for each
    operand.
    Proposed assembly syntax:
        !7 = metadata !MDSubprogram(line: 45, name: "foo",
displayName: "foo",
                                    linkageName: "_Z3foov", file:
metadata !8,
                                    function: i32 (i32)* @foo)
 6. Remove the dead code for `GenericDebugMDNode`.
 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW
    traffic during bitcode serialization.  Now that metadata types are
    known, we can write debug info out in an order that makes it cheap
    to read back in.
    Note that using `MDUser` will make RAUW much cheaper, since we're
    using the use-list infrastructure for most of them.  If RAUW isn't
    showing up in a profile, I may skip this.
Does this direction seem reasonable?  Any major problems I've missed?
David Blaikie
2014-Oct-13  22:23 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> In r219010, I merged integer and string fields into a single header > field. By reducing the number of metadata operands used in debug info, > this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > I've concluded that they will be insufficient. >Could you explain what your end-goal here looked like and what data you used to evaluate its insufficiency? Just to be clear, what I was picturing was that, starting with your initial improvement, we'd string-ify more data in the records but eventually we'd start stringifying across records (eg: rolling a DW_TAG_structure_type's members into the structure type itself, one big string). In the end we'd just pull out the non-metadata references (like the llvm::Function* in the DW_TAG_subroutine_type metadata) into a table kept separately from a handful of big strings of debug info (I say a handful, as we'd keep the types separate so they could be easily deduplicated).> Instead, I'd like to implement a more aggressive plan, which as a > side-effect cleans up the much "loved" debug info IR assembly syntax. > > At a high-level, the idea is to create distinct subclasses of `Value` > for each debug info concept,My concern with this is baking parts of our current debug info representation into IR constructs seems rather heavyweight. If we need to add first class IR constructs to cope with debug info I'd hope to find, ideally, one, general purpose extension we can use for this (& possibly for other things). But maybe the bar for adding first class IR constructs is lower than I've imagined it to be.> starting with line table entries and moving > on to the DIDescriptor hierarchy. By leveraging the use-list > infrastructure for metadata operands -- i.e., only using value handles > for non-metadata operands -- we'll improve memory usage and increase > RAUW speed. > > My rough plan follows. I quote some numbers for memory savings below > based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > -save-temps option) that currently peaks at 15.3GB. > > 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > must all be metadata. The cost per operand is 1 pointer, vs. 4 > pointers in an `MDNode`. >Perhaps a generic MD-only-node might be a sufficiently generically valuable IR construct. A similar alternative: A schematized metadata node. Much like DWARF, being able to say "this node is of some type T, defined elsewhere in the module - string, int, string, string, etc... ". Heck, this could even be just a generic improvement to llvm IR, maybe? (the textual representation might not need to change at all - IR Generation would just do much like DWARF generation in LLVM does - create abbreviation/type descriptions on the fly and share them rather than having every metadata node include its own self-description)> > 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > fields (not `Value`s) for the line and column, and use `Use` > operands for the metadata operands. > > On x86-64, this will save 104B / line table entry. Linking > `llvm-lto` uses ~7M line-table entries, so this on its own saves > ~700MB. > > Sketch of class definition: > > class MDLineTable : public MDUser { > unsigned Line; > unsigned Column; > public: > static MDLineTable *get(unsigned Line, unsigned Column, > MDNode *Scope); > static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); > static MDLineTable *getBase(MDLineTable *Inlined); > > unsigned getLine() const { return Line; } > unsigned getColumn() const { return Column; } > bool isInlined() const { return getNumOperands() == 2; } > MDNode *getScope() const { return getOperand(0); } > MDNode *getInlinedAt() const { return getOperand(1); } > }; > > Proposed assembly syntax: > > ; Not inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) > > ; Inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, > inlinedAt: metadata !10) > > ; Column defaulted to 0. > !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > > (What colour should that bike shed be?) > > 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > table entries. The cost of these is ~180B each, for another > ~600MB. > > If we integrate a side-table of `MDLineTable`s into its uniquing, > the overhead is only ~12B / line table entry, or ~80MB. This saves > 520MB.> This is somewhat perpendicular to redesigning the metadata format, > but IMO it's worth doing as soon as it's possible. > > 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > through an intermediate class `DebugMDNode` with an > allocation-time-optional `CallbackVH` available for referencing > non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > of an `MDNode`. > > This saves another ~960MB,960 from what?> for a running total of ~2GB. >~2GB is the total of what? (you mention a lot of numbers in this post, but it's not always clear what they're relative to/out of/subtracted from)> > Proposed assembly syntax: > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > fields: "0\00clang 3.6\00...", > operands: { metadata !8, ... }) > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > fields: "global_var\00...", > operands: { metadata !8, ... }, > handle: i32* @global_var) > > This syntax pulls the tag out of the current header-string, calls > the rest of the header "fields", and includes the metadata operands > in "operands". > > 5. Incrementally create subclasses of `DebugMDNode`, such as > `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > "fields" and "operands" catch-alls with explicit names for each > operand. >I wouldn't mind seeing how expensive it would be if these schema descriptions were within the module itself - so we didn't have to bake them into the IR spec, but could still share them between every usage within a module.> > Proposed assembly syntax: > > !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: > "foo", > linkageName: "_Z3foov", file: metadata > !8, > function: i32 (i32)* @foo) > > 6. Remove the dead code for `GenericDebugMDNode`. > > 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > traffic during bitcode serialization. Now that metadata types are > known, we can write debug info out in an order that makes it cheap > to read back in. > > Note that using `MDUser` will make RAUW much cheaper, since we're > using the use-list infrastructure for most of them. If RAUW isn't > showing up in a profile, I may skip this. > > Does this direction seem reasonable? Any major problems I've missed? > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/185277fd/attachment.html>
Reid Kleckner
2014-Oct-13  22:37 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
I think making debug info more of a first-class IR citizen is probably the way to go. Right now debug info is completely unreadable and is downright opposed to the design goals of the IR as I understand them. Our backwards compatibility policy should give you the flexibility you need to update the debug info representation as you go along: http://llvm.org/docs/DeveloperPolicy.html#id18 On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> In r219010, I merged integer and string fields into a single header > field. By reducing the number of metadata operands used in debug info, > this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > I've concluded that they will be insufficient. > > Instead, I'd like to implement a more aggressive plan, which as a > side-effect cleans up the much "loved" debug info IR assembly syntax. > > At a high-level, the idea is to create distinct subclasses of `Value` > for each debug info concept, starting with line table entries and moving > on to the DIDescriptor hierarchy. By leveraging the use-list > infrastructure for metadata operands -- i.e., only using value handles > for non-metadata operands -- we'll improve memory usage and increase > RAUW speed. > > My rough plan follows. I quote some numbers for memory savings below > based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > -save-temps option) that currently peaks at 15.3GB. > > 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > must all be metadata. The cost per operand is 1 pointer, vs. 4 > pointers in an `MDNode`. > > 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > fields (not `Value`s) for the line and column, and use `Use` > operands for the metadata operands. > > On x86-64, this will save 104B / line table entry. Linking > `llvm-lto` uses ~7M line-table entries, so this on its own saves > ~700MB. > > Sketch of class definition: > > class MDLineTable : public MDUser { > unsigned Line; > unsigned Column; > public: > static MDLineTable *get(unsigned Line, unsigned Column, > MDNode *Scope); > static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); > static MDLineTable *getBase(MDLineTable *Inlined); > > unsigned getLine() const { return Line; } > unsigned getColumn() const { return Column; } > bool isInlined() const { return getNumOperands() == 2; } > MDNode *getScope() const { return getOperand(0); } > MDNode *getInlinedAt() const { return getOperand(1); } > }; > > Proposed assembly syntax: > > ; Not inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) > > ; Inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, > inlinedAt: metadata !10) > > ; Column defaulted to 0. > !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > > (What colour should that bike shed be?) > > 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > table entries. The cost of these is ~180B each, for another > ~600MB. > > If we integrate a side-table of `MDLineTable`s into its uniquing, > the overhead is only ~12B / line table entry, or ~80MB. This saves > 520MB. > > This is somewhat perpendicular to redesigning the metadata format, > but IMO it's worth doing as soon as it's possible. > > 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > through an intermediate class `DebugMDNode` with an > allocation-time-optional `CallbackVH` available for referencing > non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > of an `MDNode`. > > This saves another ~960MB, for a running total of ~2GB. > > Proposed assembly syntax: > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > fields: "0\00clang 3.6\00...", > operands: { metadata !8, ... }) > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > fields: "global_var\00...", > operands: { metadata !8, ... }, > handle: i32* @global_var) > > This syntax pulls the tag out of the current header-string, calls > the rest of the header "fields", and includes the metadata operands > in "operands". > > 5. Incrementally create subclasses of `DebugMDNode`, such as > `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > "fields" and "operands" catch-alls with explicit names for each > operand. > > Proposed assembly syntax: > > !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: > "foo", > linkageName: "_Z3foov", file: metadata > !8, > function: i32 (i32)* @foo) > > 6. Remove the dead code for `GenericDebugMDNode`. > > 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > traffic during bitcode serialization. Now that metadata types are > known, we can write debug info out in an order that makes it cheap > to read back in. > > Note that using `MDUser` will make RAUW much cheaper, since we're > using the use-list infrastructure for most of them. If RAUW isn't > showing up in a profile, I may skip this. > > Does this direction seem reasonable? Any major problems I've missed? > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/426293ca/attachment.html>
David Blaikie
2014-Oct-13  22:47 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Mon, Oct 13, 2014 at 3:37 PM, Reid Kleckner <rnk at google.com> wrote:> I think making debug info more of a first-class IR citizen is probably the > way to go. Right now debug info is completely unreadable and is downright > opposed to the design goals of the IR as I understand them. >I'm still not sure this would produce particularly more legible, let alone writeable, debug info IR. It's possible, certainly, if the schema was baked into IR reading and writing, that we could pretty print it with annotated field names and allow writing the debug info with omitted fields (because the parser would know that this was, say, a subprogram record, and be able to reorder fields to the required schema or add default values for omitted fields), but I'm not sure we'd get that far nor whether it would really tip debug info to the point of writeability - it's still necessarily a format that describes code, which tends towards being more ungainly than the code itself. ("this thing is on line 42" rather than "thing" written on line 42) I'd have to see examples & promises of where this would go/what value it would add, but I'd still be fairly concerned about the ongoing costs.> Our backwards compatibility policy should give you the flexibility you > need to update the debug info representation as you go along: > http://llvm.org/docs/DeveloperPolicy.html#id18 >It's a rather heavy burden to carry. Currently we have a much lighter cost to changing the debug info schema (rev the version number - any debug info with an older version number is dropped on sight).> > > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith < > dexonsmith at apple.com> wrote: > >> In r219010, I merged integer and string fields into a single header >> field. By reducing the number of metadata operands used in debug info, >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and >> I've concluded that they will be insufficient. >> >> Instead, I'd like to implement a more aggressive plan, which as a >> side-effect cleans up the much "loved" debug info IR assembly syntax. >> >> At a high-level, the idea is to create distinct subclasses of `Value` >> for each debug info concept, starting with line table entries and moving >> on to the DIDescriptor hierarchy. By leveraging the use-list >> infrastructure for metadata operands -- i.e., only using value handles >> for non-metadata operands -- we'll improve memory usage and increase >> RAUW speed. >> >> My rough plan follows. I quote some numbers for memory savings below >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's >> -save-temps option) that currently peaks at 15.3GB. >> >> 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s >> must all be metadata. The cost per operand is 1 pointer, vs. 4 >> pointers in an `MDNode`. >> >> 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal >> fields (not `Value`s) for the line and column, and use `Use` >> operands for the metadata operands. >> >> On x86-64, this will save 104B / line table entry. Linking >> `llvm-lto` uses ~7M line-table entries, so this on its own saves >> ~700MB. >> >> Sketch of class definition: >> >> class MDLineTable : public MDUser { >> unsigned Line; >> unsigned Column; >> public: >> static MDLineTable *get(unsigned Line, unsigned Column, >> MDNode *Scope); >> static MDLineTable *getInlined(MDLineTable *Base, MDNode >> *Scope); >> static MDLineTable *getBase(MDLineTable *Inlined); >> >> unsigned getLine() const { return Line; } >> unsigned getColumn() const { return Column; } >> bool isInlined() const { return getNumOperands() == 2; } >> MDNode *getScope() const { return getOperand(0); } >> MDNode *getInlinedAt() const { return getOperand(1); } >> }; >> >> Proposed assembly syntax: >> >> ; Not inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata >> !9) >> >> ; Inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata >> !9, >> inlinedAt: metadata !10) >> >> ; Column defaulted to 0. >> !7 = metadata !MDLineTable(line: 45, scope: metadata !9) >> >> (What colour should that bike shed be?) >> >> 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows >> that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line >> table entries. The cost of these is ~180B each, for another >> ~600MB. >> >> If we integrate a side-table of `MDLineTable`s into its uniquing, >> the overhead is only ~12B / line table entry, or ~80MB. This saves >> 520MB. >> >> This is somewhat perpendicular to redesigning the metadata format, >> but IMO it's worth doing as soon as it's possible. >> >> 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` >> through an intermediate class `DebugMDNode` with an >> allocation-time-optional `CallbackVH` available for referencing >> non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead >> of an `MDNode`. >> >> This saves another ~960MB, for a running total of ~2GB. >> >> Proposed assembly syntax: >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, >> fields: "0\00clang 3.6\00...", >> operands: { metadata !8, ... }) >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, >> fields: "global_var\00...", >> operands: { metadata !8, ... }, >> handle: i32* @global_var) >> >> This syntax pulls the tag out of the current header-string, calls >> the rest of the header "fields", and includes the metadata operands >> in "operands". >> >> 5. Incrementally create subclasses of `DebugMDNode`, such as >> `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the >> "fields" and "operands" catch-alls with explicit names for each >> operand. >> >> Proposed assembly syntax: >> >> !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: >> "foo", >> linkageName: "_Z3foov", file: >> metadata !8, >> function: i32 (i32)* @foo) >> >> 6. Remove the dead code for `GenericDebugMDNode`. >> >> 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW >> traffic during bitcode serialization. Now that metadata types are >> known, we can write debug info out in an order that makes it cheap >> to read back in. >> >> Note that using `MDUser` will make RAUW much cheaper, since we're >> using the use-list infrastructure for most of them. If RAUW isn't >> showing up in a profile, I may skip this. >> >> Does this direction seem reasonable? Any major problems I've missed? >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/631fd1f6/attachment.html>
Duncan P. N. Exon Smith
2014-Oct-13  23:30 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
> On Oct 13, 2014, at 3:23 PM, David Blaikie <dblaikie at gmail.com> wrote: > > > > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote: >> In r219010, I merged integer and string fields into a single header >> field. By reducing the number of metadata operands used in debug info, >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and >> I've concluded that they will be insufficient. >> > Could you explain what your end-goal here looked like and what data you used to evaluate its insufficiency?In the links of C++ programs I've looked at, most `Value`s are line tables and local variables. E.g., for the `llvm-lto.lto.bc` case I've used for memory numbers: - 23967800 Value - 16837368 MDNode - 7611669 DIDescriptor - 4373879 DW_TAG_arg_variable - 1341021 DW_TAG_subprogram - 554992 DW_TAG_auto_variable - 360390 DW_TAG_lexical_block - 354166 DW_TAG_subroutine_type - 7500000 line table entries - 5850877 User - 693869 MDString IIUC, line tables and local variables need to be referenced directly from the rest of the IR, so they can't be sunk into other nodes. Relevant to your question, I didn't a way to sufficiently decrease the numbers of these (or the number of their operands).> Just to be clear, what I was picturing was that, starting with your initial improvement, we'd string-ify more data in the records but eventually we'd start stringifying across records (eg: rolling a DW_TAG_structure_type's members into the structure type itself, one big string). In the end we'd just pull out the non-metadata references (like the llvm::Function* in the DW_TAG_subroutine_type metadata) into a table kept separately from a handful of big strings of debug info (I say a handful, as we'd keep the types separate so they could be easily deduplicated).I was thinking along the same lines. Unfortunately, there aren't enough types left for that to make a big impact. Unless you envisioned a completely different way of dealing with `@llvm.dbg.value` and `!dbg` references?>> Instead, I'd like to implement a more aggressive plan, which as a >> side-effect cleans up the much "loved" debug info IR assembly syntax. >> >> At a high-level, the idea is to create distinct subclasses of `Value` >> for each debug info concept, > > My concern with this is baking parts of our current debug info representation into IR constructs seems rather heavyweight. If we need to add first class IR constructs to cope with debug info I'd hope to find, ideally, one, general purpose extension we can use for this (& possibly for other things). But maybe the bar for adding first class IR constructs is lower than I've imagined it to be.Since 75% of all `Value`s are debug info, representing them well seems worthwhile to me.>> starting with line table entries and moving >> on to the DIDescriptor hierarchy. By leveraging the use-list >> infrastructure for metadata operands -- i.e., only using value handles >> for non-metadata operands -- we'll improve memory usage and increase >> RAUW speed. >> >> My rough plan follows.(Note the following sentence, which I think you missed.)>> I quote some numbers for memory savings below >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's >> -save-temps option) that currently peaks at 15.3GB. >> >> 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s >> must all be metadata. The cost per operand is 1 pointer, vs. 4 >> pointers in an `MDNode`. > > Perhaps a generic MD-only-node might be a sufficiently generically valuable IR construct. > > A similar alternative: A schematized metadata node. Much like DWARF, being able to say "this node is of some type T, defined elsewhere in the module - string, int, string, string, etc... ". Heck, this could even be just a generic improvement to llvm IR, maybe? (the textual representation might not need to change at all - IR Generation would just do much like DWARF generation in LLVM does - create abbreviation/type descriptions on the fly and share them rather than having every metadata node include its own self-description) >"Being generic" seems like a defect to me, not a feature. If you need to add support for every IR construct to the backend to emit DIEs, etc., then what's the benefit in being able to express arbitrary other things?>> 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal >> fields (not `Value`s) for the line and column, and use `Use` >> operands for the metadata operands. >> >> On x86-64, this will save 104B / line table entry. Linking >> `llvm-lto` uses ~7M line-table entries, so this on its own saves >> ~700MB. >> >> Sketch of class definition: >> >> class MDLineTable : public MDUser { >> unsigned Line; >> unsigned Column; >> public: >> static MDLineTable *get(unsigned Line, unsigned Column, >> MDNode *Scope); >> static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); >> static MDLineTable *getBase(MDLineTable *Inlined); >> >> unsigned getLine() const { return Line; } >> unsigned getColumn() const { return Column; } >> bool isInlined() const { return getNumOperands() == 2; } >> MDNode *getScope() const { return getOperand(0); } >> MDNode *getInlinedAt() const { return getOperand(1); } >> }; >> >> Proposed assembly syntax: >> >> ; Not inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) >> >> ; Inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, >> inlinedAt: metadata !10) >> >> ; Column defaulted to 0. >> !7 = metadata !MDLineTable(line: 45, scope: metadata !9) >> >> (What colour should that bike shed be?) >> >> 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows >> that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line >> table entries. The cost of these is ~180B each, for another >> ~600MB. >> >> If we integrate a side-table of `MDLineTable`s into its uniquing, >> the overhead is only ~12B / line table entry, or ~80MB. This saves >> 520MB. >> >> This is somewhat perpendicular to redesigning the metadata format, >> but IMO it's worth doing as soon as it's possible. >> >> 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` >> through an intermediate class `DebugMDNode` with an >> allocation-time-optional `CallbackVH` available for referencing >> non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead >> of an `MDNode`. >> >> This saves another ~960MB, > > 960 from what?This number references the sentence noted above.> >> for a running total of ~2GB. > > ~2GB is the total of what? (you mention a lot of numbers in this post, but it's not always clear what they're relative to/out of/subtracted from)This number references the sentence noted above.>> >> Proposed assembly syntax: >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, >> fields: "0\00clang 3.6\00...", >> operands: { metadata !8, ... }) >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, >> fields: "global_var\00...", >> operands: { metadata !8, ... }, >> handle: i32* @global_var) >> >> This syntax pulls the tag out of the current header-string, calls >> the rest of the header "fields", and includes the metadata operands >> in "operands". >> >> 5. Incrementally create subclasses of `DebugMDNode`, such as >> `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the >> "fields" and "operands" catch-alls with explicit names for each >> operand. > > I wouldn't mind seeing how expensive it would be if these schema descriptions were within the module itself - so we didn't have to bake them into the IR spec, but could still share them between every usage within a module.It's already baked into the IR spec, since the backend needs to understand debug info to emit it. We might as well understand what exactly we're representing by formalizing it.> >> >> Proposed assembly syntax: >> >> !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: "foo", >> linkageName: "_Z3foov", file: metadata !8, >> function: i32 (i32)* @foo) >> >> 6. Remove the dead code for `GenericDebugMDNode`. >> >> 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW >> traffic during bitcode serialization. Now that metadata types are >> known, we can write debug info out in an order that makes it cheap >> to read back in. >> >> Note that using `MDUser` will make RAUW much cheaper, since we're >> using the use-list infrastructure for most of them. If RAUW isn't >> showing up in a profile, I may skip this. >> >> Does this direction seem reasonable? Any major problems I've missed?
Sean Silva
2014-Oct-14  01:59 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
For those interested, I've attached some pie charts based on Duncan's data in one of the other posts; successive slides break down the usage increasingly finely. To my understanding, they represent the number of Value's (and subclasses) allocated. On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> In r219010, I merged integer and string fields into a single header > field. By reducing the number of metadata operands used in debug info, > this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > I've concluded that they will be insufficient. > > Instead, I'd like to implement a more aggressive plan, which as a > side-effect cleans up the much "loved" debug info IR assembly syntax. > > At a high-level, the idea is to create distinct subclasses of `Value` > for each debug info concept, starting with line table entries and moving > on to the DIDescriptor hierarchy. By leveraging the use-list > infrastructure for metadata operands -- i.e., only using value handles > for non-metadata operands -- we'll improve memory usage and increase > RAUW speed. > > My rough plan follows. I quote some numbers for memory savings below > based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > -save-temps option) that currently peaks at 15.3GB. >Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far. Also, you seem to really like saying "peak" here. Is there a definite peak? When does it occur?> > 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > must all be metadata. The cost per operand is 1 pointer, vs. 4 > pointers in an `MDNode`. > > 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > fields (not `Value`s) for the line and column, and use `Use` > operands for the metadata operands. > > On x86-64, this will save 104B / line table entry. Linking > `llvm-lto` uses ~7M line-table entries, so this on its own saves > ~700MB.> Sketch of class definition: > > class MDLineTable : public MDUser { > unsigned Line; > unsigned Column; > public: > static MDLineTable *get(unsigned Line, unsigned Column, > MDNode *Scope); > static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); > static MDLineTable *getBase(MDLineTable *Inlined); > > unsigned getLine() const { return Line; } > unsigned getColumn() const { return Column; } > bool isInlined() const { return getNumOperands() == 2; } > MDNode *getScope() const { return getOperand(0); } > MDNode *getInlinedAt() const { return getOperand(1); } > }; > > Proposed assembly syntax: > > ; Not inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) > > ; Inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, > inlinedAt: metadata !10) > > ; Column defaulted to 0. > !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > > (What colour should that bike shed be?) > > 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > table entries. The cost of these is ~180B each, for another > ~600MB. > > If we integrate a side-table of `MDLineTable`s into its uniquing, > the overhead is only ~12B / line table entry, or ~80MB. This saves > 520MB. > > This is somewhat perpendicular to redesigning the metadata format, > but IMO it's worth doing as soon as it's possible. > > 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > through an intermediate class `DebugMDNode` with an > allocation-time-optional `CallbackVH` available for referencing > non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > of an `MDNode`. > > This saves another ~960MB, for a running total of ~2GB. >2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4". As a rough back of the envelope calculation, dividing 15.3GB by ~24 million Values gives about 600 bytes per Value. That seems sort of excessive (but is it realistic?). All of the data types that you are proposing to shrink fall far short of this "average size", meaning that if you are trying to reduce memory usage, you might be looking in the wrong place. Something smells fishy. At the very least, this would indicate that the real memory usage is elsewhere. A pie chart breaking down the total memory usage seems essential to have here.> > Proposed assembly syntax: > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > fields: "0\00clang 3.6\00...", > operands: { metadata !8, ... }) > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > fields: "global_var\00...", > operands: { metadata !8, ... }, > handle: i32* @global_var) > > This syntax pulls the tag out of the current header-string, calls > the rest of the header "fields", and includes the metadata operands > in "operands". > > 5. Incrementally create subclasses of `DebugMDNode`, such as > `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > "fields" and "operands" catch-alls with explicit names for each > operand. > > Proposed assembly syntax: > > !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: > "foo", > linkageName: "_Z3foov", file: metadata > !8, > function: i32 (i32)* @foo) > > 6. Remove the dead code for `GenericDebugMDNode`. > > 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > traffic during bitcode serialization. Now that metadata types are > known, we can write debug info out in an order that makes it cheap > to read back in. > > Note that using `MDUser` will make RAUW much cheaper, since we're > using the use-list infrastructure for most of them. If RAUW isn't > showing up in a profile, I may skip this. > > Does this direction seem reasonable? Any major problems I've missed? >You need more data. Right now you have essentially one data point, and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number). Do the numbers change significantly for different projects? (e.g. Chromium or Firefox or a kernel or a large app you have handy to compile with LTO?). If you have specific data you want (and a suggestion for how to gather it), I can also get your numbers for one of our internal games as well. Once you have some more data, then as a first step, I would like to see an analysis of how much we can "ideally" expect to gain (back of the envelope calculations == win). -- Sean Silva> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: DebugInfoSize.pdf Type: application/pdf Size: 108040 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.pdf>
Eric Christopher
2014-Oct-14  02:01 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Mon, Oct 13, 2014 at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote:> For those interested, I've attached some pie charts based on Duncan's data > in one of the other posts; successive slides break down the usage > increasingly finely. To my understanding, they represent the number of > Value's (and subclasses) allocated. > > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith > <dexonsmith at apple.com> wrote: >> >> In r219010, I merged integer and string fields into a single header >> field. By reducing the number of metadata operands used in debug info, >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and >> I've concluded that they will be insufficient. >> >> Instead, I'd like to implement a more aggressive plan, which as a >> side-effect cleans up the much "loved" debug info IR assembly syntax. >> >> At a high-level, the idea is to create distinct subclasses of `Value` >> for each debug info concept, starting with line table entries and moving >> on to the DIDescriptor hierarchy. By leveraging the use-list >> infrastructure for metadata operands -- i.e., only using value handles >> for non-metadata operands -- we'll improve memory usage and increase >> RAUW speed. >> >> My rough plan follows. I quote some numbers for memory savings below >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's >> -save-temps option) that currently peaks at 15.3GB. > > > Stupid question, but when I was working on LTO last Summer the primary > culprit for excessive memory use was due to us not being smart when linking > the IR together (Espindola would know more details). Do we still have that > problem? For starters, how does the memory usage of just llvm-link compare > to the memory usage of the actual LTO run? If the issue I was seeing last > Summer is still there, you should see that the invocation of llvm-link is > actually the most memory-intensive part of the LTO step, by far. >This is vague. Could you be more specific on where you saw all of the memory? -eric> > Also, you seem to really like saying "peak" here. Is there a definite peak? > When does it occur? > > >> >> >> 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s >> must all be metadata. The cost per operand is 1 pointer, vs. 4 >> pointers in an `MDNode`. >> >> 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal >> fields (not `Value`s) for the line and column, and use `Use` >> operands for the metadata operands. >> >> On x86-64, this will save 104B / line table entry. Linking >> `llvm-lto` uses ~7M line-table entries, so this on its own saves >> ~700MB. >> >> >> Sketch of class definition: >> >> class MDLineTable : public MDUser { >> unsigned Line; >> unsigned Column; >> public: >> static MDLineTable *get(unsigned Line, unsigned Column, >> MDNode *Scope); >> static MDLineTable *getInlined(MDLineTable *Base, MDNode >> *Scope); >> static MDLineTable *getBase(MDLineTable *Inlined); >> >> unsigned getLine() const { return Line; } >> unsigned getColumn() const { return Column; } >> bool isInlined() const { return getNumOperands() == 2; } >> MDNode *getScope() const { return getOperand(0); } >> MDNode *getInlinedAt() const { return getOperand(1); } >> }; >> >> Proposed assembly syntax: >> >> ; Not inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata >> !9) >> >> ; Inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata >> !9, >> inlinedAt: metadata !10) >> >> ; Column defaulted to 0. >> !7 = metadata !MDLineTable(line: 45, scope: metadata !9) >> >> (What colour should that bike shed be?) >> >> 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows >> that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line >> table entries. The cost of these is ~180B each, for another >> ~600MB. >> >> If we integrate a side-table of `MDLineTable`s into its uniquing, >> the overhead is only ~12B / line table entry, or ~80MB. This saves >> 520MB. >> >> This is somewhat perpendicular to redesigning the metadata format, >> but IMO it's worth doing as soon as it's possible. >> >> 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` >> through an intermediate class `DebugMDNode` with an >> allocation-time-optional `CallbackVH` available for referencing >> non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead >> of an `MDNode`. >> >> This saves another ~960MB, for a running total of ~2GB. > > > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a > single pie slice near 40% of the # of Value's allocated and another at 21%. > Especially this being "step 4". > > As a rough back of the envelope calculation, dividing 15.3GB by ~24 million > Values gives about 600 bytes per Value. That seems sort of excessive (but is > it realistic?). All of the data types that you are proposing to shrink fall > far short of this "average size", meaning that if you are trying to reduce > memory usage, you might be looking in the wrong place. Something smells > fishy. At the very least, this would indicate that the real memory usage is > elsewhere. > > A pie chart breaking down the total memory usage seems essential to have > here. > >> >> >> Proposed assembly syntax: >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, >> fields: "0\00clang 3.6\00...", >> operands: { metadata !8, ... }) >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, >> fields: "global_var\00...", >> operands: { metadata !8, ... }, >> handle: i32* @global_var) >> >> This syntax pulls the tag out of the current header-string, calls >> the rest of the header "fields", and includes the metadata operands >> in "operands". >> >> 5. Incrementally create subclasses of `DebugMDNode`, such as >> `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the >> "fields" and "operands" catch-alls with explicit names for each >> operand. >> >> Proposed assembly syntax: >> >> !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: >> "foo", >> linkageName: "_Z3foov", file: metadata >> !8, >> function: i32 (i32)* @foo) >> >> 6. Remove the dead code for `GenericDebugMDNode`. >> >> 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW >> traffic during bitcode serialization. Now that metadata types are >> known, we can write debug info out in an order that makes it cheap >> to read back in. >> >> Note that using `MDUser` will make RAUW much cheaper, since we're >> using the use-list infrastructure for most of them. If RAUW isn't >> showing up in a profile, I may skip this. >> >> Does this direction seem reasonable? Any major problems I've missed? > > > You need more data. Right now you have essentially one data point, and it's > not even clear what you measured really. If your goal is saving memory, I > would expect at least a pie chart that breaks down LLVM's memory usage (not > just # of allocations of different sorts; an approximation is fine, as long > as you explain how you arrived at it and in what sense it approximates the > true number). > > Do the numbers change significantly for different projects? (e.g. Chromium > or Firefox or a kernel or a large app you have handy to compile with LTO?). > If you have specific data you want (and a suggestion for how to gather it), > I can also get your numbers for one of our internal games as well. > > Once you have some more data, then as a first step, I would like to see an > analysis of how much we can "ideally" expect to gain (back of the envelope > calculations == win). > > -- Sean Silva > >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >
Duncan P. N. Exon Smith
2014-Oct-14  18:40 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
> On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote: > > Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far.To be clear, I'm running the command-line: $ llvm-lto -exported-symbol _main llvm-lto.lto.bc Since this is a pre-linked bitcode file, we shouldn't be wasting much memory from the linking stage. Running ld64 directly gives a peak memory footprint of ~30GB for the full link, so there's something else going on there that I'll be digging into later.> 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4".15.3GB is the peak memory of `llvm-lto`. This comes late in the process, after DIEs have been created. I haven't looked in detail past debug info metadata, but here's a sketch of what I imagine is in memory at this point. - The IR, including uniquing side-tables. - Optimization and backend passes. - Parts of SelectionDAG that haven't been freed. - `MachineFunction`s and everything inside them. - Whatever state the `AsmPrinter`, etc., need. I expect to look at a couple of other debug-info-related memory usage areas once I've shrunk the metadata: - What's the total footprint of DIEs? This run has 4M of them, whose allocated footprint is ~1GB. I'm hoping that a deeper look will reveal an even larger attack surface. - How much do debug info intrinsics cost? They show up in at least three forms -- IR-level, SDNodes, and MachineInstrs -- and there can be a lot of them. How many? What's their footprint? For now, I'm focusing on the problem I've already identified.> You need more data. Right now you have essentially one data point,I looked at a number of internal C and C++ programs with -flto -g, and dug deeply into llvm-lto.lto.bc because it's small enough that it's easy to analyze (and its runtime profile was representative of the other C++ programs I was looking at). I didn't look deeply at a broad spectrum, but memory usage and runtime for building clang with -flto -g is something we care a fair bit about.> and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number).I'm not sure there's value in diving deeply into everything at once. I've identified one of the bottlenecks, so I'd like to improve it before digging into the others. Here's some visibility into where my numbers come from. I got the 15.3GB from a profile of memory usage vs. time. Peak usage comes late in the process, around when DIEs are being dealt with. Metadata node counts stabilize much earlier in the process. The rest of the numbers are based on counting `MDNodes` and their respective `MDNodeOperands`, and multiplying by the cost of their operands. Here's a dump from around the peak metadata node count: LineTables = 7500000[30000000], InlinedLineTables = 6756182, Directives = 7611669[42389128], Arrays = 570609[577447], Others = 1176556[5133065] Tag = 256, Count = 554992, Ops = 2531428, Name = DW_TAG_auto_variable Tag = 16647, Count = 988, Ops = 4940, Name = DW_TAG_GNU_template_parameter_pack Tag = 52, Count = 9933, Ops = 59598, Name = DW_TAG_variable Tag = 33, Count = 190, Ops = 190, Name = DW_TAG_subrange_type Tag = 59, Count = 1, Ops = 3, Name = DW_TAG_unspecified_type Tag = 40, Count = 24731, Ops = 24731, Name = DW_TAG_enumerator Tag = 21, Count = 354166, Ops = 2833328, Name = DW_TAG_subroutine_type Tag = 2, Count = 77999, Ops = 623992, Name = DW_TAG_class_type Tag = 47, Count = 27122, Ops = 108488, Name = DW_TAG_template_type_parameter Tag = 28, Count = 8491, Ops = 33964, Name = DW_TAG_inheritance Tag = 66, Count = 10930, Ops = 43720, Name = DW_TAG_rvalue_reference_type Tag = 16, Count = 54680, Ops = 218720, Name = DW_TAG_reference_type Tag = 23, Count = 624, Ops = 4992, Name = DW_TAG_union_type Tag = 4, Count = 5344, Ops = 42752, Name = DW_TAG_enumeration_type Tag = 11, Count = 360390, Ops = 1081170, Name = DW_TAG_lexical_block Tag = 258, Count = 1, Ops = 1, Name = DW_TAG_expression Tag = 13, Count = 73880, Ops = 299110, Name = DW_TAG_member Tag = 58, Count = 1387, Ops = 4161, Name = DW_TAG_imported_module Tag = 1, Count = 2747, Ops = 21976, Name = DW_TAG_array_type Tag = 46, Count = 1341021, Ops = 12069189, Name = DW_TAG_subprogram Tag = 257, Count = 4373879, Ops = 20785065, Name = DW_TAG_arg_variable Tag = 8, Count = 2246, Ops = 6738, Name = DW_TAG_imported_declaration Tag = 53, Count = 57, Ops = 228, Name = DW_TAG_volatile_type Tag = 15, Count = 55163, Ops = 220652, Name = DW_TAG_pointer_type Tag = 41, Count = 3382, Ops = 6764, Name = DW_TAG_file_type Tag = 22, Count = 158479, Ops = 633916, Name = DW_TAG_typedef Tag = 48, Count = 486, Ops = 2430, Name = DW_TAG_template_value_parameter Tag = 36, Count = 15, Ops = 45, Name = DW_TAG_base_type Tag = 17, Count = 1164, Ops = 8148, Name = DW_TAG_compile_unit Tag = 31, Count = 19, Ops = 95, Name = DW_TAG_ptr_to_member_type Tag = 57, Count = 2034, Ops = 6102, Name = DW_TAG_namespace Tag = 38, Count = 32133, Ops = 128532, Name = DW_TAG_const_type Tag = 19, Count = 72995, Ops = 583960, Name = DW_TAG_structure_type (Note: the InlinedLineTables stat is included in LineTables stat.) You can determine the rough memory footprint of each type of node by multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the "Ops" by `sizeof(MDNodeOperand)` (x86-64: 32B). Overall, there are 7.5M linetables with 30M operands, so by this method their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M operands, so their footprint is ~1.7GB. I dumped another stat periodically to tell me the peak size of the side-tables for line table entries, which are split into "Scopes" (for non-inlined) and "Inlined" (these counts are disjoint, unlike the previous stats): Scopes = 203166 [203166], Inlined = 3500000 [3500000] I assumed that both `DenseMap` and `std::vector` over-allocate by 50% to estimate the current (and planned) costs for the side-tables. Another stat I dumped periodically was the breakdown between V(alues), U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings). Here's a sample from nearby: V = 23967800 (40200000 - 16232200) U = 5850877 ( 7365503 - 1514626) C = 205491 ( 279134 - 73643) M = 16837368 (31009291 - 14171923) S = 693869 ( 693869 - 0) Lastly, I dumped a breakdown of the types of MDNodeOperands. This is also a sample from nearby: MDOps = 77644750 (100%) Const = 14947077 ( 19%) Node = 41749475 ( 53%) Str = 9553581 ( 12%) Null = 10976693 ( 14%) Other = 417924 ( 0%) While I didn't use this breakdown for my memory estimates, it was interesting nevertheless. Note the following: - The number of constants is just under 15M. This dump came less than a second before the dump above, where we have 7.5M line table entries. Line table entries have 2 operands of `ConstantInt`. This lines up nicely. Note: this checked `isa<Constant>(Op) && !isa<GlobalValue>(Op)`. - There are a lot of null operands. By making subclasses for the various types of debug info IR, we can probably shed some of these altogether. - There are few "Other" operands. These are likely all `GlobalValue` references, and are the only operands that need to be referenced using value handles.
Alex Rosenberg
2014-Oct-16  03:53 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
As all of these transforms are 1-to-1, can we still support the older metadata and convert it on the fly? Alex> On Oct 13, 2014, at 3:02 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote: > > In r219010, I merged integer and string fields into a single header > field. By reducing the number of metadata operands used in debug info, > this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > I've concluded that they will be insufficient. > > Instead, I'd like to implement a more aggressive plan, which as a > side-effect cleans up the much "loved" debug info IR assembly syntax. > > At a high-level, the idea is to create distinct subclasses of `Value` > for each debug info concept, starting with line table entries and moving > on to the DIDescriptor hierarchy. By leveraging the use-list > infrastructure for metadata operands -- i.e., only using value handles > for non-metadata operands -- we'll improve memory usage and increase > RAUW speed. > > My rough plan follows. I quote some numbers for memory savings below > based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > -save-temps option) that currently peaks at 15.3GB. > > 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > must all be metadata. The cost per operand is 1 pointer, vs. 4 > pointers in an `MDNode`. > > 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > fields (not `Value`s) for the line and column, and use `Use` > operands for the metadata operands. > > On x86-64, this will save 104B / line table entry. Linking > `llvm-lto` uses ~7M line-table entries, so this on its own saves > ~700MB. > > Sketch of class definition: > > class MDLineTable : public MDUser { > unsigned Line; > unsigned Column; > public: > static MDLineTable *get(unsigned Line, unsigned Column, > MDNode *Scope); > static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); > static MDLineTable *getBase(MDLineTable *Inlined); > > unsigned getLine() const { return Line; } > unsigned getColumn() const { return Column; } > bool isInlined() const { return getNumOperands() == 2; } > MDNode *getScope() const { return getOperand(0); } > MDNode *getInlinedAt() const { return getOperand(1); } > }; > > Proposed assembly syntax: > > ; Not inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) > > ; Inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, > inlinedAt: metadata !10) > > ; Column defaulted to 0. > !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > > (What colour should that bike shed be?) > > 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > table entries. The cost of these is ~180B each, for another > ~600MB. > > If we integrate a side-table of `MDLineTable`s into its uniquing, > the overhead is only ~12B / line table entry, or ~80MB. This saves > 520MB. > > This is somewhat perpendicular to redesigning the metadata format, > but IMO it's worth doing as soon as it's possible. > > 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > through an intermediate class `DebugMDNode` with an > allocation-time-optional `CallbackVH` available for referencing > non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > of an `MDNode`. > > This saves another ~960MB, for a running total of ~2GB. > > Proposed assembly syntax: > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > fields: "0\00clang 3.6\00...", > operands: { metadata !8, ... }) > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > fields: "global_var\00...", > operands: { metadata !8, ... }, > handle: i32* @global_var) > > This syntax pulls the tag out of the current header-string, calls > the rest of the header "fields", and includes the metadata operands > in "operands". > > 5. Incrementally create subclasses of `DebugMDNode`, such as > `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > "fields" and "operands" catch-alls with explicit names for each > operand. > > Proposed assembly syntax: > > !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: "foo", > linkageName: "_Z3foov", file: metadata !8, > function: i32 (i32)* @foo) > > 6. Remove the dead code for `GenericDebugMDNode`. > > 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > traffic during bitcode serialization. Now that metadata types are > known, we can write debug info out in an order that makes it cheap > to read back in. > > Note that using `MDUser` will make RAUW much cheaper, since we're > using the use-list infrastructure for most of them. If RAUW isn't > showing up in a profile, I may skip this. > > Does this direction seem reasonable? Any major problems I've missed? > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Eric Christopher
2014-Oct-16  06:30 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Wed, Oct 15, 2014 at 8:53 PM, Alex Rosenberg <alexr at leftfield.org> wrote:> As all of these transforms are 1-to-1, can we still support the older metadata and convert it on the fly? >I'd prefer not to keep all of that code around to interpret both versions without a very good reason. -eric> Alex > >> On Oct 13, 2014, at 3:02 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote: >> >> In r219010, I merged integer and string fields into a single header >> field. By reducing the number of metadata operands used in debug info, >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and >> I've concluded that they will be insufficient. >> >> Instead, I'd like to implement a more aggressive plan, which as a >> side-effect cleans up the much "loved" debug info IR assembly syntax. >> >> At a high-level, the idea is to create distinct subclasses of `Value` >> for each debug info concept, starting with line table entries and moving >> on to the DIDescriptor hierarchy. By leveraging the use-list >> infrastructure for metadata operands -- i.e., only using value handles >> for non-metadata operands -- we'll improve memory usage and increase >> RAUW speed. >> >> My rough plan follows. I quote some numbers for memory savings below >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's >> -save-temps option) that currently peaks at 15.3GB. >> >> 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s >> must all be metadata. The cost per operand is 1 pointer, vs. 4 >> pointers in an `MDNode`. >> >> 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal >> fields (not `Value`s) for the line and column, and use `Use` >> operands for the metadata operands. >> >> On x86-64, this will save 104B / line table entry. Linking >> `llvm-lto` uses ~7M line-table entries, so this on its own saves >> ~700MB. >> >> Sketch of class definition: >> >> class MDLineTable : public MDUser { >> unsigned Line; >> unsigned Column; >> public: >> static MDLineTable *get(unsigned Line, unsigned Column, >> MDNode *Scope); >> static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); >> static MDLineTable *getBase(MDLineTable *Inlined); >> >> unsigned getLine() const { return Line; } >> unsigned getColumn() const { return Column; } >> bool isInlined() const { return getNumOperands() == 2; } >> MDNode *getScope() const { return getOperand(0); } >> MDNode *getInlinedAt() const { return getOperand(1); } >> }; >> >> Proposed assembly syntax: >> >> ; Not inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) >> >> ; Inlined. >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, >> inlinedAt: metadata !10) >> >> ; Column defaulted to 0. >> !7 = metadata !MDLineTable(line: 45, scope: metadata !9) >> >> (What colour should that bike shed be?) >> >> 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows >> that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line >> table entries. The cost of these is ~180B each, for another >> ~600MB. >> >> If we integrate a side-table of `MDLineTable`s into its uniquing, >> the overhead is only ~12B / line table entry, or ~80MB. This saves >> 520MB. >> >> This is somewhat perpendicular to redesigning the metadata format, >> but IMO it's worth doing as soon as it's possible. >> >> 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` >> through an intermediate class `DebugMDNode` with an >> allocation-time-optional `CallbackVH` available for referencing >> non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead >> of an `MDNode`. >> >> This saves another ~960MB, for a running total of ~2GB. >> >> Proposed assembly syntax: >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, >> fields: "0\00clang 3.6\00...", >> operands: { metadata !8, ... }) >> >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, >> fields: "global_var\00...", >> operands: { metadata !8, ... }, >> handle: i32* @global_var) >> >> This syntax pulls the tag out of the current header-string, calls >> the rest of the header "fields", and includes the metadata operands >> in "operands". >> >> 5. Incrementally create subclasses of `DebugMDNode`, such as >> `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the >> "fields" and "operands" catch-alls with explicit names for each >> operand. >> >> Proposed assembly syntax: >> >> !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: "foo", >> linkageName: "_Z3foov", file: metadata !8, >> function: i32 (i32)* @foo) >> >> 6. Remove the dead code for `GenericDebugMDNode`. >> >> 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW >> traffic during bitcode serialization. Now that metadata types are >> known, we can write debug info out in an order that makes it cheap >> to read back in. >> >> Note that using `MDUser` will make RAUW much cheaper, since we're >> using the use-list infrastructure for most of them. If RAUW isn't >> showing up in a profile, I may skip this. >> >> Does this direction seem reasonable? Any major problems I've missed? >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev