thr3ads.net - llvm dev - [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Sean Silva

2014-Oct-14 01:59 UTC

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

For those interested, I've attached some pie charts based on Duncan's
data
in one of the other posts; successive slides break down the usage
increasingly finely. To my understanding, they represent the number of
Value's (and subclasses) allocated.

On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:
> In r219010, I merged integer and string fields into a single header
> field.  By reducing the number of metadata operands used in debug info,
> this saved 2.2GB on an `llvm-lto` bootstrap.  I've done some profiling
> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
> I've concluded that they will be insufficient.
>
> Instead, I'd like to implement a more aggressive plan, which as a
> side-effect cleans up the much "loved" debug info IR assembly
syntax.
>
> At a high-level, the idea is to create distinct subclasses of `Value`
> for each debug info concept, starting with line table entries and moving
> on to the DIDescriptor hierarchy.  By leveraging the use-list
> infrastructure for metadata operands -- i.e., only using value handles
> for non-metadata operands -- we'll improve memory usage and increase
> RAUW speed.
>
> My rough plan follows.  I quote some numbers for memory savings below
> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
> -save-temps option) that currently peaks at 15.3GB.
>
Stupid question, but when I was working on LTO last Summer the primary
culprit for excessive memory use was due to us not being smart when linking
the IR together (Espindola would know more details). Do we still have that
problem? For starters, how does the memory usage of just llvm-link compare
to the memory usage of the actual LTO run? If the issue I was seeing last
Summer is still there, you should see that the invocation of llvm-link is
actually the most memory-intensive part of the LTO step, by far.


Also, you seem to really like saying "peak" here. Is there a definite
peak?
When does it occur?


>
>  1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
>     must all be metadata.  The cost per operand is 1 pointer, vs. 4
>     pointers in an `MDNode`.
>
>  2. Create `MDLineTable` as the first subclass of `MDUser`.  Use normal
>     fields (not `Value`s) for the line and column, and use `Use`
>     operands for the metadata operands.
>
>     On x86-64, this will save 104B / line table entry.  Linking
>     `llvm-lto` uses ~7M line-table entries, so this on its own saves
>     ~700MB.
>     Sketch of class definition:
>
>         class MDLineTable : public MDUser {
>           unsigned Line;
>           unsigned Column;
>         public:
>           static MDLineTable *get(unsigned Line, unsigned Column,
>                                   MDNode *Scope);
>           static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);
>           static MDLineTable *getBase(MDLineTable *Inlined);
>
>           unsigned getLine() const { return Line; }
>           unsigned getColumn() const { return Column; }
>           bool isInlined() const { return getNumOperands() == 2; }
>           MDNode *getScope() const { return getOperand(0); }
>           MDNode *getInlinedAt() const { return getOperand(1); }
>         };
>
>     Proposed assembly syntax:
>
>         ; Not inlined.
>         !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)
>
>         ; Inlined.
>         !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,
>                                    inlinedAt: metadata !10)
>
>         ; Column defaulted to 0.
>         !7 = metadata !MDLineTable(line: 45, scope: metadata !9)
>
>     (What colour should that bike shed be?)
>
>  3. (Optional) Rewrite `DebugLoc` lookup tables.  My profiling shows
>     that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line
>     table entries.  The cost of these is ~180B each, for another
>     ~600MB.
>
>     If we integrate a side-table of `MDLineTable`s into its uniquing,
>     the overhead is only ~12B / line table entry, or ~80MB.  This saves
>     520MB.
>
>     This is somewhat perpendicular to redesigning the metadata format,
>     but IMO it's worth doing as soon as it's possible.
>
>  4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
>     through an intermediate class `DebugMDNode` with an
>     allocation-time-optional `CallbackVH` available for referencing
>     non-metadata.  Change `DIDescriptor` to wrap a `DebugMDNode` instead
>     of an `MDNode`.
>
>     This saves another ~960MB, for a running total of ~2GB.
>
2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a
single pie slice near 40% of the # of Value's allocated and another at 21%.
Especially this being "step 4".

As a rough back of the envelope calculation, dividing 15.3GB by ~24 million
Values gives about 600 bytes per Value. That seems sort of excessive (but
is it realistic?). All of the data types that you are proposing to shrink
fall far short of this "average size", meaning that if you are trying
to
reduce memory usage, you might be looking in the wrong place. Something
smells fishy. At the very least, this would indicate that the real memory
usage is elsewhere.

A pie chart breaking down the total memory usage seems essential to have
here.

>
>     Proposed assembly syntax:
>
>         !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
>                                           fields: "0\00clang
3.6\00...",
>                                           operands: { metadata !8, ... })
>
>         !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
>                                           fields:
"global_var\00...",
>                                           operands: { metadata !8, ... },
>                                           handle: i32* @global_var)
>
>     This syntax pulls the tag out of the current header-string, calls
>     the rest of the header "fields", and includes the metadata
operands
>     in "operands".
>
>  5. Incrementally create subclasses of `DebugMDNode`, such as
>     `MDCompileUnit` and `MDSubprogram`.  Sub-classed nodes replace the
>     "fields" and "operands" catch-alls with explicit
names for each
>     operand.
>
>     Proposed assembly syntax:
>
>         !7 = metadata !MDSubprogram(line: 45, name: "foo",
displayName:
> "foo",
>                                     linkageName: "_Z3foov", file:
metadata
> !8,
>                                     function: i32 (i32)* @foo)
>
>  6. Remove the dead code for `GenericDebugMDNode`.
>
>  7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW
>     traffic during bitcode serialization.  Now that metadata types are
>     known, we can write debug info out in an order that makes it cheap
>     to read back in.
>
>     Note that using `MDUser` will make RAUW much cheaper, since we're
>     using the use-list infrastructure for most of them.  If RAUW isn't
>     showing up in a profile, I may skip this.
>
> Does this direction seem reasonable?  Any major problems I've missed?
>
You need more data. Right now you have essentially one data point, and it's
not even clear what you measured really. If your goal is saving memory, I
would expect at least a pie chart that breaks down LLVM's memory usage (not
just # of allocations of different sorts; an approximation is fine, as long
as you explain how you arrived at it and in what sense it approximates the
true number).

Do the numbers change significantly for different projects? (e.g. Chromium
or Firefox or a kernel or a large app you have handy to compile with LTO?).
If you have specific data you want (and a suggestion for how to gather it),
I can also get your numbers for one of our internal games as well.

Once you have some more data, then as a first step, I would like to see an
analysis of how much we can "ideally" expect to gain (back of the
envelope
calculations == win).

-- Sean Silva

>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DebugInfoSize.pdf
Type: application/pdf
Size: 108040 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/b1da4b87/attachment.pdf>

Rafael Espíndola

2014-Oct-14 20:17 UTC

head link

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

>> Stupid question, but when I was working on LTO last Summer the primary
>> culprit for excessive memory use was due to us not being smart when
linking
>> the IR together (Espindola would know more details). Do we still have
that
>> problem? For starters, how does the memory usage of just llvm-link
compare
>> to the memory usage of the actual LTO run? If the issue I was seeing
last
>> Summer is still there, you should see that the invocation of llvm-link
is
>> actually the most memory-intensive part of the LTO step, by far.
>>
>
> This is vague. Could you be more specific on where you saw all of the
memory?
I think Sean is referring to the old problem of nodes not being merged
because of cycles. It has been fixed by breaking the cycles by having
some of the edges be represented with stable mangled names.

The problem that Duncan is trying to solve is that the debug info is
still very large, even with the duplicate information removed.

Cheers,
Rafael

Sean Silva

2014-Oct-15 21:30 UTC

head link

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

On Mon, Oct 13, 2014 at 7:01 PM, Eric Christopher <echristo at gmail.com>
wrote:
> On Mon, Oct 13, 2014 at 6:59 PM, Sean Silva <chisophugis at
gmail.com> wrote:
> > For those interested, I've attached some pie charts based on
Duncan's
> data
> > in one of the other posts; successive slides break down the usage
> > increasingly finely. To my understanding, they represent the number of
> > Value's (and subclasses) allocated.
> >
> > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith
> > <dexonsmith at apple.com> wrote:
> >>
> >> In r219010, I merged integer and string fields into a single
header
> >> field.  By reducing the number of metadata operands used in debug
info,
> >> this saved 2.2GB on an `llvm-lto` bootstrap.  I've done some
profiling
> >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle
next, and
> >> I've concluded that they will be insufficient.
> >>
> >> Instead, I'd like to implement a more aggressive plan, which
as a
> >> side-effect cleans up the much "loved" debug info IR
assembly syntax.
> >>
> >> At a high-level, the idea is to create distinct subclasses of
`Value`
> >> for each debug info concept, starting with line table entries and
moving
> >> on to the DIDescriptor hierarchy.  By leveraging the use-list
> >> infrastructure for metadata operands -- i.e., only using value
handles
> >> for non-metadata operands -- we'll improve memory usage and
increase
> >> RAUW speed.
> >>
> >> My rough plan follows.  I quote some numbers for memory savings
below
> >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running
`llvm-lto`
> >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by
ld64's
> >> -save-temps option) that currently peaks at 15.3GB.
> >
> >
> > Stupid question, but when I was working on LTO last Summer the primary
> > culprit for excessive memory use was due to us not being smart when
> linking
> > the IR together (Espindola would know more details). Do we still have
> that
> > problem? For starters, how does the memory usage of just llvm-link
> compare
> > to the memory usage of the actual LTO run? If the issue I was seeing
last
> > Summer is still there, you should see that the invocation of llvm-link
is
> > actually the most memory-intensive part of the LTO step, by far.
> >
>
> This is vague. Could you be more specific on where you saw all of the
> memory?
>
Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
without -g it completed with much less). The increasing could be easily
watched on the system "process monitor" in real time.

-- Sean Silva

>
> -eric
>
> >
> > Also, you seem to really like saying "peak" here. Is there a
definite
> peak?
> > When does it occur?
> >
> >
> >>
> >>
> >>  1. Introduce `MDUser`, which inherits from `User`, and whose
`Use`s
> >>     must all be metadata.  The cost per operand is 1 pointer, vs.
4
> >>     pointers in an `MDNode`.
> >>
> >>  2. Create `MDLineTable` as the first subclass of `MDUser`.  Use
normal
> >>     fields (not `Value`s) for the line and column, and use `Use`
> >>     operands for the metadata operands.
> >>
> >>     On x86-64, this will save 104B / line table entry.  Linking
> >>     `llvm-lto` uses ~7M line-table entries, so this on its own
saves
> >>     ~700MB.
> >>
> >>
> >>     Sketch of class definition:
> >>
> >>         class MDLineTable : public MDUser {
> >>           unsigned Line;
> >>           unsigned Column;
> >>         public:
> >>           static MDLineTable *get(unsigned Line, unsigned Column,
> >>                                   MDNode *Scope);
> >>           static MDLineTable *getInlined(MDLineTable *Base, MDNode
> >> *Scope);
> >>           static MDLineTable *getBase(MDLineTable *Inlined);
> >>
> >>           unsigned getLine() const { return Line; }
> >>           unsigned getColumn() const { return Column; }
> >>           bool isInlined() const { return getNumOperands() == 2; }
> >>           MDNode *getScope() const { return getOperand(0); }
> >>           MDNode *getInlinedAt() const { return getOperand(1); }
> >>         };
> >>
> >>     Proposed assembly syntax:
> >>
> >>         ; Not inlined.
> >>         !7 = metadata !MDLineTable(line: 45, column: 7, scope:
metadata
> >> !9)
> >>
> >>         ; Inlined.
> >>         !7 = metadata !MDLineTable(line: 45, column: 7, scope:
metadata
> >> !9,
> >>                                    inlinedAt: metadata !10)
> >>
> >>         ; Column defaulted to 0.
> >>         !7 = metadata !MDLineTable(line: 45, scope: metadata !9)
> >>
> >>     (What colour should that bike shed be?)
> >>
> >>  3. (Optional) Rewrite `DebugLoc` lookup tables.  My profiling
shows
> >>     that we have 3.5M entries in the `DebugLoc` side-vectors for
7M line
> >>     table entries.  The cost of these is ~180B each, for another
> >>     ~600MB.
> >>
> >>     If we integrate a side-table of `MDLineTable`s into its
uniquing,
> >>     the overhead is only ~12B / line table entry, or ~80MB.  This
saves
> >>     520MB.
> >>
> >>     This is somewhat perpendicular to redesigning the metadata
format,
> >>     but IMO it's worth doing as soon as it's possible.
> >>
> >>  4. Create `GenericDebugMDNode`, a transitional subclass of
`MDUser`
> >>     through an intermediate class `DebugMDNode` with an
> >>     allocation-time-optional `CallbackVH` available for
referencing
> >>     non-metadata.  Change `DIDescriptor` to wrap a `DebugMDNode`
instead
> >>     of an `MDNode`.
> >>
> >>     This saves another ~960MB, for a running total of ~2GB.
> >
> >
> > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we
have
> a
> > single pie slice near 40% of the # of Value's allocated and
another at
> 21%.
> > Especially this being "step 4".
> >
> > As a rough back of the envelope calculation, dividing 15.3GB by ~24
> million
> > Values gives about 600 bytes per Value. That seems sort of excessive
> (but is
> > it realistic?). All of the data types that you are proposing to shrink
> fall
> > far short of this "average size", meaning that if you are
trying to
> reduce
> > memory usage, you might be looking in the wrong place. Something
smells
> > fishy. At the very least, this would indicate that the real memory
usage
> is
> > elsewhere.
> >
> > A pie chart breaking down the total memory usage seems essential to
have
> > here.
> >
> >>
> >>
> >>     Proposed assembly syntax:
> >>
> >>         !7 = metadata !GenericDebugMDNode(tag:
DW_TAG_compile_unit,
> >>                                           fields: "0\00clang
3.6\00...",
> >>                                           operands: { metadata !8,
...
> })
> >>
> >>         !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
> >>                                           fields:
"global_var\00...",
> >>                                           operands: { metadata !8,
...
> },
> >>                                           handle: i32*
@global_var)
> >>
> >>     This syntax pulls the tag out of the current header-string,
calls
> >>     the rest of the header "fields", and includes the
metadata operands
> >>     in "operands".
> >>
> >>  5. Incrementally create subclasses of `DebugMDNode`, such as
> >>     `MDCompileUnit` and `MDSubprogram`.  Sub-classed nodes replace
the
> >>     "fields" and "operands" catch-alls with
explicit names for each
> >>     operand.
> >>
> >>     Proposed assembly syntax:
> >>
> >>         !7 = metadata !MDSubprogram(line: 45, name:
"foo", displayName:
> >> "foo",
> >>                                     linkageName:
"_Z3foov", file:
> metadata
> >> !8,
> >>                                     function: i32 (i32)* @foo)
> >>
> >>  6. Remove the dead code for `GenericDebugMDNode`.
> >>
> >>  7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW
> >>     traffic during bitcode serialization.  Now that metadata types
are
> >>     known, we can write debug info out in an order that makes it
cheap
> >>     to read back in.
> >>
> >>     Note that using `MDUser` will make RAUW much cheaper, since
we're
> >>     using the use-list infrastructure for most of them.  If RAUW
isn't
> >>     showing up in a profile, I may skip this.
> >>
> >> Does this direction seem reasonable?  Any major problems I've
missed?
> >
> >
> > You need more data. Right now you have essentially one data point, and
> it's
> > not even clear what you measured really. If your goal is saving
memory, I
> > would expect at least a pie chart that breaks down LLVM's memory
usage
> (not
> > just # of allocations of different sorts; an approximation is fine, as
> long
> > as you explain how you arrived at it and in what sense it approximates
> the
> > true number).
> >
> > Do the numbers change significantly for different projects? (e.g.
> Chromium
> > or Firefox or a kernel or a large app you have handy to compile with
> LTO?).
> > If you have specific data you want (and a suggestion for how to gather
> it),
> > I can also get your numbers for one of our internal games as well.
> >
> > Once you have some more data, then as a first step, I would like to
see
> an
> > analysis of how much we can "ideally" expect to gain (back
of the
> envelope
> > calculations == win).
> >
> > -- Sean Silva
> >
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141015/776ebe91/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2014 - [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

Possibly Parallel Threads