Sean Silva
2014-Oct-14 01:59 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
For those interested, I've attached some pie charts based on Duncan's data in one of the other posts; successive slides break down the usage increasingly finely. To my understanding, they represent the number of Value's (and subclasses) allocated. On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> In r219010, I merged integer and string fields into a single header > field. By reducing the number of metadata operands used in debug info, > this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > I've concluded that they will be insufficient. > > Instead, I'd like to implement a more aggressive plan, which as a > side-effect cleans up the much "loved" debug info IR assembly syntax. > > At a high-level, the idea is to create distinct subclasses of `Value` > for each debug info concept, starting with line table entries and moving > on to the DIDescriptor hierarchy. By leveraging the use-list > infrastructure for metadata operands -- i.e., only using value handles > for non-metadata operands -- we'll improve memory usage and increase > RAUW speed. > > My rough plan follows. I quote some numbers for memory savings below > based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > -save-temps option) that currently peaks at 15.3GB. >Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far. Also, you seem to really like saying "peak" here. Is there a definite peak? When does it occur?> > 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > must all be metadata. The cost per operand is 1 pointer, vs. 4 > pointers in an `MDNode`. > > 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > fields (not `Value`s) for the line and column, and use `Use` > operands for the metadata operands. > > On x86-64, this will save 104B / line table entry. Linking > `llvm-lto` uses ~7M line-table entries, so this on its own saves > ~700MB.> Sketch of class definition: > > class MDLineTable : public MDUser { > unsigned Line; > unsigned Column; > public: > static MDLineTable *get(unsigned Line, unsigned Column, > MDNode *Scope); > static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope); > static MDLineTable *getBase(MDLineTable *Inlined); > > unsigned getLine() const { return Line; } > unsigned getColumn() const { return Column; } > bool isInlined() const { return getNumOperands() == 2; } > MDNode *getScope() const { return getOperand(0); } > MDNode *getInlinedAt() const { return getOperand(1); } > }; > > Proposed assembly syntax: > > ; Not inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9) > > ; Inlined. > !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9, > inlinedAt: metadata !10) > > ; Column defaulted to 0. > !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > > (What colour should that bike shed be?) > > 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > table entries. The cost of these is ~180B each, for another > ~600MB. > > If we integrate a side-table of `MDLineTable`s into its uniquing, > the overhead is only ~12B / line table entry, or ~80MB. This saves > 520MB. > > This is somewhat perpendicular to redesigning the metadata format, > but IMO it's worth doing as soon as it's possible. > > 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > through an intermediate class `DebugMDNode` with an > allocation-time-optional `CallbackVH` available for referencing > non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > of an `MDNode`. > > This saves another ~960MB, for a running total of ~2GB. >2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4". As a rough back of the envelope calculation, dividing 15.3GB by ~24 million Values gives about 600 bytes per Value. That seems sort of excessive (but is it realistic?). All of the data types that you are proposing to shrink fall far short of this "average size", meaning that if you are trying to reduce memory usage, you might be looking in the wrong place. Something smells fishy. At the very least, this would indicate that the real memory usage is elsewhere. A pie chart breaking down the total memory usage seems essential to have here.> > Proposed assembly syntax: > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > fields: "0\00clang 3.6\00...", > operands: { metadata !8, ... }) > > !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > fields: "global_var\00...", > operands: { metadata !8, ... }, > handle: i32* @global_var) > > This syntax pulls the tag out of the current header-string, calls > the rest of the header "fields", and includes the metadata operands > in "operands". > > 5. Incrementally create subclasses of `DebugMDNode`, such as > `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > "fields" and "operands" catch-alls with explicit names for each > operand. > > Proposed assembly syntax: > > !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: > "foo", > linkageName: "_Z3foov", file: metadata > !8, > function: i32 (i32)* @foo) > > 6. Remove the dead code for `GenericDebugMDNode`. > > 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > traffic during bitcode serialization. Now that metadata types are > known, we can write debug info out in an order that makes it cheap > to read back in. > > Note that using `MDUser` will make RAUW much cheaper, since we're > using the use-list infrastructure for most of them. If RAUW isn't > showing up in a profile, I may skip this. > > Does this direction seem reasonable? Any major problems I've missed? >You need more data. Right now you have essentially one data point, and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number). Do the numbers change significantly for different projects? (e.g. Chromium or Firefox or a kernel or a large app you have handy to compile with LTO?). If you have specific data you want (and a suggestion for how to gather it), I can also get your numbers for one of our internal games as well. Once you have some more data, then as a first step, I would like to see an analysis of how much we can "ideally" expect to gain (back of the envelope calculations == win). -- Sean Silva> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: DebugInfoSize.pdf Type: application/pdf Size: 108040 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.pdf>
Rafael EspĂndola
2014-Oct-14 20:17 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
>> Stupid question, but when I was working on LTO last Summer the primary >> culprit for excessive memory use was due to us not being smart when linking >> the IR together (Espindola would know more details). Do we still have that >> problem? For starters, how does the memory usage of just llvm-link compare >> to the memory usage of the actual LTO run? If the issue I was seeing last >> Summer is still there, you should see that the invocation of llvm-link is >> actually the most memory-intensive part of the LTO step, by far. >> > > This is vague. Could you be more specific on where you saw all of the memory?I think Sean is referring to the old problem of nodes not being merged because of cycles. It has been fixed by breaking the cycles by having some of the edges be represented with stable mangled names. The problem that Duncan is trying to solve is that the debug info is still very large, even with the duplicate information removed. Cheers, Rafael
Sean Silva
2014-Oct-15 21:30 UTC
[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
On Mon, Oct 13, 2014 at 7:01 PM, Eric Christopher <echristo at gmail.com> wrote:> On Mon, Oct 13, 2014 at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote: > > For those interested, I've attached some pie charts based on Duncan's > data > > in one of the other posts; successive slides break down the usage > > increasingly finely. To my understanding, they represent the number of > > Value's (and subclasses) allocated. > > > > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith > > <dexonsmith at apple.com> wrote: > >> > >> In r219010, I merged integer and string fields into a single header > >> field. By reducing the number of metadata operands used in debug info, > >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling > >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and > >> I've concluded that they will be insufficient. > >> > >> Instead, I'd like to implement a more aggressive plan, which as a > >> side-effect cleans up the much "loved" debug info IR assembly syntax. > >> > >> At a high-level, the idea is to create distinct subclasses of `Value` > >> for each debug info concept, starting with line table entries and moving > >> on to the DIDescriptor hierarchy. By leveraging the use-list > >> infrastructure for metadata operands -- i.e., only using value handles > >> for non-metadata operands -- we'll improve memory usage and increase > >> RAUW speed. > >> > >> My rough plan follows. I quote some numbers for memory savings below > >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto` > >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's > >> -save-temps option) that currently peaks at 15.3GB. > > > > > > Stupid question, but when I was working on LTO last Summer the primary > > culprit for excessive memory use was due to us not being smart when > linking > > the IR together (Espindola would know more details). Do we still have > that > > problem? For starters, how does the memory usage of just llvm-link > compare > > to the memory usage of the actual LTO run? If the issue I was seeing last > > Summer is still there, you should see that the invocation of llvm-link is > > actually the most memory-intensive part of the LTO step, by far. > > > > This is vague. Could you be more specific on where you saw all of the > memory? >Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g; without -g it completed with much less). The increasing could be easily watched on the system "process monitor" in real time. -- Sean Silva> > -eric > > > > > Also, you seem to really like saying "peak" here. Is there a definite > peak? > > When does it occur? > > > > > >> > >> > >> 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s > >> must all be metadata. The cost per operand is 1 pointer, vs. 4 > >> pointers in an `MDNode`. > >> > >> 2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal > >> fields (not `Value`s) for the line and column, and use `Use` > >> operands for the metadata operands. > >> > >> On x86-64, this will save 104B / line table entry. Linking > >> `llvm-lto` uses ~7M line-table entries, so this on its own saves > >> ~700MB. > >> > >> > >> Sketch of class definition: > >> > >> class MDLineTable : public MDUser { > >> unsigned Line; > >> unsigned Column; > >> public: > >> static MDLineTable *get(unsigned Line, unsigned Column, > >> MDNode *Scope); > >> static MDLineTable *getInlined(MDLineTable *Base, MDNode > >> *Scope); > >> static MDLineTable *getBase(MDLineTable *Inlined); > >> > >> unsigned getLine() const { return Line; } > >> unsigned getColumn() const { return Column; } > >> bool isInlined() const { return getNumOperands() == 2; } > >> MDNode *getScope() const { return getOperand(0); } > >> MDNode *getInlinedAt() const { return getOperand(1); } > >> }; > >> > >> Proposed assembly syntax: > >> > >> ; Not inlined. > >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata > >> !9) > >> > >> ; Inlined. > >> !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata > >> !9, > >> inlinedAt: metadata !10) > >> > >> ; Column defaulted to 0. > >> !7 = metadata !MDLineTable(line: 45, scope: metadata !9) > >> > >> (What colour should that bike shed be?) > >> > >> 3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows > >> that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line > >> table entries. The cost of these is ~180B each, for another > >> ~600MB. > >> > >> If we integrate a side-table of `MDLineTable`s into its uniquing, > >> the overhead is only ~12B / line table entry, or ~80MB. This saves > >> 520MB. > >> > >> This is somewhat perpendicular to redesigning the metadata format, > >> but IMO it's worth doing as soon as it's possible. > >> > >> 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser` > >> through an intermediate class `DebugMDNode` with an > >> allocation-time-optional `CallbackVH` available for referencing > >> non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead > >> of an `MDNode`. > >> > >> This saves another ~960MB, for a running total of ~2GB. > > > > > > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have > a > > single pie slice near 40% of the # of Value's allocated and another at > 21%. > > Especially this being "step 4". > > > > As a rough back of the envelope calculation, dividing 15.3GB by ~24 > million > > Values gives about 600 bytes per Value. That seems sort of excessive > (but is > > it realistic?). All of the data types that you are proposing to shrink > fall > > far short of this "average size", meaning that if you are trying to > reduce > > memory usage, you might be looking in the wrong place. Something smells > > fishy. At the very least, this would indicate that the real memory usage > is > > elsewhere. > > > > A pie chart breaking down the total memory usage seems essential to have > > here. > > > >> > >> > >> Proposed assembly syntax: > >> > >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit, > >> fields: "0\00clang 3.6\00...", > >> operands: { metadata !8, ... > }) > >> > >> !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable, > >> fields: "global_var\00...", > >> operands: { metadata !8, ... > }, > >> handle: i32* @global_var) > >> > >> This syntax pulls the tag out of the current header-string, calls > >> the rest of the header "fields", and includes the metadata operands > >> in "operands". > >> > >> 5. Incrementally create subclasses of `DebugMDNode`, such as > >> `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the > >> "fields" and "operands" catch-alls with explicit names for each > >> operand. > >> > >> Proposed assembly syntax: > >> > >> !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: > >> "foo", > >> linkageName: "_Z3foov", file: > metadata > >> !8, > >> function: i32 (i32)* @foo) > >> > >> 6. Remove the dead code for `GenericDebugMDNode`. > >> > >> 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW > >> traffic during bitcode serialization. Now that metadata types are > >> known, we can write debug info out in an order that makes it cheap > >> to read back in. > >> > >> Note that using `MDUser` will make RAUW much cheaper, since we're > >> using the use-list infrastructure for most of them. If RAUW isn't > >> showing up in a profile, I may skip this. > >> > >> Does this direction seem reasonable? Any major problems I've missed? > > > > > > You need more data. Right now you have essentially one data point, and > it's > > not even clear what you measured really. If your goal is saving memory, I > > would expect at least a pie chart that breaks down LLVM's memory usage > (not > > just # of allocations of different sorts; an approximation is fine, as > long > > as you explain how you arrived at it and in what sense it approximates > the > > true number). > > > > Do the numbers change significantly for different projects? (e.g. > Chromium > > or Firefox or a kernel or a large app you have handy to compile with > LTO?). > > If you have specific data you want (and a suggestion for how to gather > it), > > I can also get your numbers for one of our internal games as well. > > > > Once you have some more data, then as a first step, I would like to see > an > > analysis of how much we can "ideally" expect to gain (back of the > envelope > > calculations == win). > > > > -- Sean Silva > > > >> > >> > >> _______________________________________________ > >> LLVM Developers mailing list > >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141015/776ebe91/attachment.html>