thr3ads.net - llvm dev - [LLVMdev] Proposal: type uniquing of debug info for LTO [Jun 2013]

If this information is useful, please help other people find it:
Share via:

David Blaikie

2013-Jun-21 00:18 UTC

[LLVMdev] Proposal: type uniquing of debug info for LTO

On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com>
wrote:>
> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>
>> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com>
wrote:
>>>
>>> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com>
wrote:
>>>
>>>
>>> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>>>
>>> Hi Manman,
>>>
>>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at
apple.com> wrote:
>>>
>>>
>>> The intent of this proposal is to speedup compilation of
"-flto -g" for c++
>>> programs.
>>> This is based on discussions with Adrian, David and Eric.
>>>
>>>
>>> Thanks for bringing this back to the list. The original thread was
>>> getting quite long.
>>>
>>> ---------------------------
>>> Problem:
>>> A single class can be used in multiple source files and the DI
(Debug Info)
>>> class is included in multiple bc files. The duplication of
>>> class definitions causes blow-up in # of MDNodes, # of DIEs,
leading to
>>> large memory requirement.
>>>
>>> As an example, SPEC xalancbmk requires 7GB of memory when compiled
with
>>> -flto -g.
>>> With a preliminary implementation of type uniquing, the memory
usage will be
>>> down to 2GB.
>>>
>>> In order to unique types, we have to break cycles in the MDNodes.
>>>
>>> A simple struct definition
>>> struct Base {
>>> int a;
>>> };
>>> can cause cycles in MDNodes:
>>> !12 = metadata !{i32 786451, metadata !13, null, metadata
!"Base", i32 1,
>>> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null,
null} ; [
>>> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset
0] [from ]
>>> !14 = metadata !{metadata !15, metadata !16}
>>> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata
!"a", i32
>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ]
[a] [line
>>> 2, size 32, align 32, offset 0] [from int]
>>> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata
!"Base",
>>> metadata !"Base", metadata !"", i32 1, metadata
!17, i1 false, i1 false, i32
>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20,
i32 1} ;
>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>
>>> Cycles: !12 -- !14 -- !15 -- !12
>>>          !12 -- !14 -- !16 -- !12
>>>
>>> These cycles make it hard to unique the same struct used in two bc
files.
>>>
>>> ---------------------------
>>> How to fix:
>>>
>>> We attach a hash value to types to help type uniquing and we also
replace
>>> references to types with their hash values.
>>> For the above struct "Base", we now have the following
MDNodes:
>>> !4 = metadata !{i32 786451, metadata !5, null, metadata
!"Base", i32 1, i64
>>> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null,
i32
>>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32,
align 32,
>>> offset 0] [from ]
>>> !6 = metadata !{metadata !7, metadata !9}
>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata
!"a", i32
>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ]
[a] [line
>>> 2, size 32, align 32, offset 0] [from int]
>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata
!"Base",
>>> metadata !"Base", metadata !"", i32 1, metadata
!10, i1 false, i1 false, i32
>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13,
i32 1} ;
>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>
>>> Note that the cycles are gone and !4 has a hash value of 915398439,
and the
>>> references to !4 are replaced with 915398439.
>>> Thanks Eric for suggesting replacing MD reference with a hash
value.
>>>
>>>
>>> In particular I recommended this:
>>>
>>> a) For C++ odr replace it with the "hash" that's just
a string
>>> representing the type name.
>>> b) For Internal C++ types and all C types replace it with a string
>>> that's a concatenation of the type name and the name of the
compile
>>> unit.
>>>
>>> Yes, that is what we agreed on over email.
>>>
>>>
>>>
>>> There are a few issues:
>>> 1> How to generate the hash for a given type?
>>> With C++'s ODR, it should be enough by using the context and
the name for
>>> non-internal c++ types.
>>> For internal c++ types and types of other languages, hash will not
be used.
>>>
>>>
>>> Explain this?
>>>
>>>
>>> For a while, I am going to support both hash and MD reference, once
>>> everything is working with hash,
>>> I will update all debug info testing cases, turn -gtype-uniquing
on, and
>>> remove the other path.
>>>
>>> For internal c++ types, initially they will follow the path of
using MD
>>> references without a hash.
>>>
>>>
>>> My current implementation is to add a few static member functions
in MDNode
>>> to profile DI nodes differently.
>>> + /// If the array of Vals is for debug info, profile it specially
and
>>> return true.
>>> + /// If the DI node has a hash value, generate the profile using
only the
>>> hash value and the declaration flag.
>>> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals,
FoldingSetNodeID
>>> &ID);
>>>
>>> + /// If the MDNode is for debug info, profile it specially and
return true.
>>> + /// If the DI node has a hash value, generate the profile using
only the
>>> hash value and the declaration flag.
>>> +  static bool profileDebugInfoNode(const MDNode *M,
FoldingSetNodeID &ID);
>>>
>>> + /// Given a hash value and a flag, generate the profile for later
lookup.
>>> +  static bool profileDebugInfoNode(unsigned Hash, bool
Declaration,
>>> FoldingSetNodeID &ID);
>>>
>>> These static functions are called in Metadata.cpp:
>>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>>> +  if (profileDebugInfoNode(this, ID))
>>> +    return;
>>> +
>>>
>>> There are other examples of these in MDNode for handling of
specific
>>> metadata.
>>> /// Methods for metadata merging.
>>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>>>
>>> Comments are welcome on whether this violates any layering rule.
>>>
>>>
>>> As I've said many times in email, I don't think this is a
good idea
>>> and would prefer either a or b below. a) is a much simpler
solution.
>>>
>>> Any reason that why it is not a good idea?
>>>
>>>
>>> Other choices are:
>>> a> Keep a map in DwarfDebug
>>>  Keep in mind that the map is used at many stages, and it has to be
in sync
>>> with MDNodeSet.
>>> b> Generalize MDNode to be aware of hash (David can provide more
details)
>>> c> Extend MDNode to DINode and modify streamers (bitcode
reader|writer, ll
>>> reader|writer) to be aware of DINode
>>>  We can provide DINode::get(…) to create a DINode. DINode can have
its own
>>> Profile function.
>>> Other suggestions are welcome.
>>>
>>>
>>> a or b please.
>>>
>>> Option a> will require a DwarfDebug pointer in every stage of
the compiler,
>>> and passing the map to the DI classes.
>>> A rough estimation is around 100 places.
>>> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm
linker?
>>> Also the map needs to be in sync with MDNodeSet, maybe using
ValueHandle can
>>> solve the problem.
>>>
>>>
>>> What about putting the map in LLVMContextImpl?
>>> It already has a few things specifically for debug info:
>>> std::vector<DebugRecVH> ScopeRecords;
>>> DenseMap<std::pair<MDNode*, MDNode*>, int>
ScopeInlinedAt;
>>> …
>>>
>>> I remember David mentioned it once and I forgot about the
conclusion.
>>
>> I mentioned it only as speculation as to how you were implementing it
>> already (but you were doing the profile-changing stuff).
>>
>> I don't think it should be necessary to have the map (in option
(a))
>> in such a central location as LLVMContext. It should be usable just
>> from DwarfDebug for generation, and DIBuilder can have its own,
>> separate map to do similar things during DI building.
>
> We also need the map during llvm linking since linking will create new
MDNodes.
I don't understand what you mean by this. IR linking shouldn't need to
do anything debug-info-specific, it should just be the normal IR
approach.

The declaration-v-definition resolution can be done at codegen-time.
We'll have to walk all the lists of retained types anyway to build the
map, so we can do declaration-v-definition (keep definitions over
declarations when we see both) at that point.
> So any other opinion on putting it in LLVMContext other than it being
central?
>
> Thanks,
> Manman
>
>>
>>>
>>> Thanks,
>>> Manman
>>>
>>>
>>> More details for option b from David
>>>
>>> < The alternative I have in mind is a more complete version of
what
>>> < you're proposing - a full MD feature, not an MD feature
that's just
>>> < barely enough to support the needs of debug info. What we
could do is
>>> < allow the insertion of these MDHash things you spoke about but
take it
>>> < a step further and have MDNode::getOperand walk through the
hash &
>>> < give the value (in this way, DebugInfo wouldn't have to
change at all
>>> < to handle hashes - if the Metadata APIs are going to be aware
of the
>>> < hashes anyway, they might as well provide this convenience
>>> < functionality) the metadata feature would also have to have
some
>>> < blessed top-level named metadata that would have a list of
hash+MDNode
>>> < to keep those MDNodes alive (so you wouldn't have to stuff
all the
>>> < types in the retained types list - metadata would provide the
full
>>> < support, not just half of it).
>>>
>>>
>>>
>>> Transition from current DI Metadata:
>>> To have a smooth transition, I will add a flag
"-gtype-hashing" for the type
>>> uniquing work and turn it on by default when we are ready.
>>>
>>>
>>> I'd prefer just make the change to have the front end emit the
"hash"
>>> (it's not really a hash, it's just a string).
>>>
>>> Are you saying no transition period? A single patch to have correct
handling
>>> of "hash" and to update all existing testing cases?
>>>
>>>
>>> -----------------------------
>>> Patches:
>>> Expect the following patches:
>>> 1> add flag -gtype-hashing
>>> 2> add hash field to DI types
>>> 3> modify DIBuilder to use hash instead of MD reference
>>> 4> related to issue 3
>>>
>>>
>>> These can all be a single patch since it shouldn't be very
large if we
>>> go with a) above. If we go with b) then the MDNode work should be
done
>>> in isolation first and then the debug info on top of it.
>>>
>>> What is wrong with smaller patches?
>>> My estimation for all the above with a) is about 30K + testing
cases.
>>>
>>>
>>> 5> backend change (in DwarfDebug|CompileUnit) to support types
shared among
>>> compile units
>>>  requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to
ref_addr
>>>
>>>
>>> #5 can and should be done before the rest of them.
>>>
>>> I prefer to submit patches according to the flow of the compiler,
starting
>>> from the frontend, then IR, then backend.
>>> The testing cases will be added for front end, llvm-link and
backend.
>>> Any reason why #5 should be done first?
>>>
>>>
>>> All changes should be local to debug info classes except patch #4.
>>>
>>>
>>> What's patch #4?
>>>
>>> Patch #4 above: related to issue 3 (changes corresponding to how to
solve
>>> issue #3)
>>>
>>> -Manman
>>>
>>>
>>> -eric
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>

Manman Ren

2013-Jun-21 00:25 UTC

head link

[LLVMdev] Proposal: type uniquing of debug info for LTO

On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com>
wrote:
>> 
>> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>> 
>>> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at
apple.com> wrote:
>>>> 
>>>> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at
apple.com> wrote:
>>>> 
>>>> 
>>>> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>>>> 
>>>> Hi Manman,
>>>> 
>>>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at
apple.com> wrote:
>>>> 
>>>> 
>>>> The intent of this proposal is to speedup compilation of
"-flto -g" for c++
>>>> programs.
>>>> This is based on discussions with Adrian, David and Eric.
>>>> 
>>>> 
>>>> Thanks for bringing this back to the list. The original thread
was
>>>> getting quite long.
>>>> 
>>>> ---------------------------
>>>> Problem:
>>>> A single class can be used in multiple source files and the DI
(Debug Info)
>>>> class is included in multiple bc files. The duplication of
>>>> class definitions causes blow-up in # of MDNodes, # of DIEs,
leading to
>>>> large memory requirement.
>>>> 
>>>> As an example, SPEC xalancbmk requires 7GB of memory when
compiled with
>>>> -flto -g.
>>>> With a preliminary implementation of type uniquing, the memory
usage will be
>>>> down to 2GB.
>>>> 
>>>> In order to unique types, we have to break cycles in the
MDNodes.
>>>> 
>>>> A simple struct definition
>>>> struct Base {
>>>> int a;
>>>> };
>>>> can cause cycles in MDNodes:
>>>> !12 = metadata !{i32 786451, metadata !13, null, metadata
!"Base", i32 1,
>>>> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null,
null} ; [
>>>> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
offset 0] [from ]
>>>> !14 = metadata !{metadata !15, metadata !16}
>>>> !15 = metadata !{i32 786445, metadata !13, metadata !12,
metadata !"a", i32
>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member
] [a] [line
>>>> 2, size 32, align 32, offset 0] [from int]
>>>> !16 = metadata !{i32 786478, metadata !13, metadata !12,
metadata !"Base",
>>>> metadata !"Base", metadata !"", i32 1,
metadata !17, i1 false, i1 false, i32
>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata
!20, i32 1} ;
>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>> 
>>>> Cycles: !12 -- !14 -- !15 -- !12
>>>>         !12 -- !14 -- !16 -- !12
>>>> 
>>>> These cycles make it hard to unique the same struct used in two
bc files.
>>>> 
>>>> ---------------------------
>>>> How to fix:
>>>> 
>>>> We attach a hash value to types to help type uniquing and we
also replace
>>>> references to types with their hash values.
>>>> For the above struct "Base", we now have the
following MDNodes:
>>>> !4 = metadata !{i32 786451, metadata !5, null, metadata
!"Base", i32 1, i64
>>>> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0,
null, i32
>>>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32,
align 32,
>>>> offset 0] [from ]
>>>> !6 = metadata !{metadata !7, metadata !9}
>>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439,
metadata !"a", i32
>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member
] [a] [line
>>>> 2, size 32, align 32, offset 0] [from int]
>>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439,
metadata !"Base",
>>>> metadata !"Base", metadata !"", i32 1,
metadata !10, i1 false, i1 false, i32
>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata
!13, i32 1} ;
>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>> 
>>>> Note that the cycles are gone and !4 has a hash value of
915398439, and the
>>>> references to !4 are replaced with 915398439.
>>>> Thanks Eric for suggesting replacing MD reference with a hash
value.
>>>> 
>>>> 
>>>> In particular I recommended this:
>>>> 
>>>> a) For C++ odr replace it with the "hash" that's
just a string
>>>> representing the type name.
>>>> b) For Internal C++ types and all C types replace it with a
string
>>>> that's a concatenation of the type name and the name of the
compile
>>>> unit.
>>>> 
>>>> Yes, that is what we agreed on over email.
>>>> 
>>>> 
>>>> 
>>>> There are a few issues:
>>>> 1> How to generate the hash for a given type?
>>>> With C++'s ODR, it should be enough by using the context
and the name for
>>>> non-internal c++ types.
>>>> For internal c++ types and types of other languages, hash will
not be used.
>>>> 
>>>> 
>>>> Explain this?
>>>> 
>>>> 
>>>> For a while, I am going to support both hash and MD reference,
once
>>>> everything is working with hash,
>>>> I will update all debug info testing cases, turn
-gtype-uniquing on, and
>>>> remove the other path.
>>>> 
>>>> For internal c++ types, initially they will follow the path of
using MD
>>>> references without a hash.
>>>> 
>>>> 
>>>> My current implementation is to add a few static member
functions in MDNode
>>>> to profile DI nodes differently.
>>>> + /// If the array of Vals is for debug info, profile it
specially and
>>>> return true.
>>>> + /// If the DI node has a hash value, generate the profile
using only the
>>>> hash value and the declaration flag.
>>>> +  static bool profileDebugInfoNode(ArrayRef<Value*>
Vals, FoldingSetNodeID
>>>> &ID);
>>>> 
>>>> + /// If the MDNode is for debug info, profile it specially and
return true.
>>>> + /// If the DI node has a hash value, generate the profile
using only the
>>>> hash value and the declaration flag.
>>>> +  static bool profileDebugInfoNode(const MDNode *M,
FoldingSetNodeID &ID);
>>>> 
>>>> + /// Given a hash value and a flag, generate the profile for
later lookup.
>>>> +  static bool profileDebugInfoNode(unsigned Hash, bool
Declaration,
>>>> FoldingSetNodeID &ID);
>>>> 
>>>> These static functions are called in Metadata.cpp:
>>>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>>>> +  if (profileDebugInfoNode(this, ID))
>>>> +    return;
>>>> +
>>>> 
>>>> There are other examples of these in MDNode for handling of
specific
>>>> metadata.
>>>> /// Methods for metadata merging.
>>>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>>>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>>>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>>>> 
>>>> Comments are welcome on whether this violates any layering
rule.
>>>> 
>>>> 
>>>> As I've said many times in email, I don't think this is
a good idea
>>>> and would prefer either a or b below. a) is a much simpler
solution.
>>>> 
>>>> Any reason that why it is not a good idea?
>>>> 
>>>> 
>>>> Other choices are:
>>>> a> Keep a map in DwarfDebug
>>>> Keep in mind that the map is used at many stages, and it has to
be in sync
>>>> with MDNodeSet.
>>>> b> Generalize MDNode to be aware of hash (David can provide
more details)
>>>> c> Extend MDNode to DINode and modify streamers (bitcode
reader|writer, ll
>>>> reader|writer) to be aware of DINode
>>>> We can provide DINode::get(…) to create a DINode. DINode can
have its own
>>>> Profile function.
>>>> Other suggestions are welcome.
>>>> 
>>>> 
>>>> a or b please.
>>>> 
>>>> Option a> will require a DwarfDebug pointer in every stage
of the compiler,
>>>> and passing the map to the DI classes.
>>>> A rough estimation is around 100 places.
>>>> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and
llvm linker?
>>>> Also the map needs to be in sync with MDNodeSet, maybe using
ValueHandle can
>>>> solve the problem.
>>>> 
>>>> 
>>>> What about putting the map in LLVMContextImpl?
>>>> It already has a few things specifically for debug info:
>>>> std::vector<DebugRecVH> ScopeRecords;
>>>> DenseMap<std::pair<MDNode*, MDNode*>, int>
ScopeInlinedAt;
>>>> …
>>>> 
>>>> I remember David mentioned it once and I forgot about the
conclusion.
>>> 
>>> I mentioned it only as speculation as to how you were implementing
it
>>> already (but you were doing the profile-changing stuff).
>>> 
>>> I don't think it should be necessary to have the map (in option
(a))
>>> in such a central location as LLVMContext. It should be usable just
>>> from DwarfDebug for generation, and DIBuilder can have its own,
>>> separate map to do similar things during DI building.
>> 
>> We also need the map during llvm linking since linking will create new
MDNodes.
> 
> I don't understand what you mean by this. IR linking shouldn't need
to
> do anything debug-info-specific, it should just be the normal IR
> approach.
> 
> The declaration-v-definition resolution can be done at codegen-time.
> We'll have to walk all the lists of retained types anyway to build the
> map, so we can do declaration-v-definition (keep definitions over
> declarations when we see both) at that point.
Yes, you are right that at codegen-time, we can generate the map from the lists
of retained types.

But dumping the linked ll file requires the map when outputting comments of the
MDNode :]

Thanks,
Manman
> 
>> So any other opinion on putting it in LLVMContext other than it being
central?
>> 
>> Thanks,
>> Manman
>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Manman
>>>> 
>>>> 
>>>> More details for option b from David
>>>> 
>>>> < The alternative I have in mind is a more complete version
of what
>>>> < you're proposing - a full MD feature, not an MD
feature that's just
>>>> < barely enough to support the needs of debug info. What we
could do is
>>>> < allow the insertion of these MDHash things you spoke about
but take it
>>>> < a step further and have MDNode::getOperand walk through
the hash &
>>>> < give the value (in this way, DebugInfo wouldn't have
to change at all
>>>> < to handle hashes - if the Metadata APIs are going to be
aware of the
>>>> < hashes anyway, they might as well provide this convenience
>>>> < functionality) the metadata feature would also have to
have some
>>>> < blessed top-level named metadata that would have a list of
hash+MDNode
>>>> < to keep those MDNodes alive (so you wouldn't have to
stuff all the
>>>> < types in the retained types list - metadata would provide
the full
>>>> < support, not just half of it).
>>>> 
>>>> 
>>>> 
>>>> Transition from current DI Metadata:
>>>> To have a smooth transition, I will add a flag
"-gtype-hashing" for the type
>>>> uniquing work and turn it on by default when we are ready.
>>>> 
>>>> 
>>>> I'd prefer just make the change to have the front end emit
the "hash"
>>>> (it's not really a hash, it's just a string).
>>>> 
>>>> Are you saying no transition period? A single patch to have
correct handling
>>>> of "hash" and to update all existing testing cases?
>>>> 
>>>> 
>>>> -----------------------------
>>>> Patches:
>>>> Expect the following patches:
>>>> 1> add flag -gtype-hashing
>>>> 2> add hash field to DI types
>>>> 3> modify DIBuilder to use hash instead of MD reference
>>>> 4> related to issue 3
>>>> 
>>>> 
>>>> These can all be a single patch since it shouldn't be very
large if we
>>>> go with a) above. If we go with b) then the MDNode work should
be done
>>>> in isolation first and then the debug info on top of it.
>>>> 
>>>> What is wrong with smaller patches?
>>>> My estimation for all the above with a) is about 30K + testing
cases.
>>>> 
>>>> 
>>>> 5> backend change (in DwarfDebug|CompileUnit) to support
types shared among
>>>> compile units
>>>> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related
to ref_addr
>>>> 
>>>> 
>>>> #5 can and should be done before the rest of them.
>>>> 
>>>> I prefer to submit patches according to the flow of the
compiler, starting
>>>> from the frontend, then IR, then backend.
>>>> The testing cases will be added for front end, llvm-link and
backend.
>>>> Any reason why #5 should be done first?
>>>> 
>>>> 
>>>> All changes should be local to debug info classes except patch
#4.
>>>> 
>>>> 
>>>> What's patch #4?
>>>> 
>>>> Patch #4 above: related to issue 3 (changes corresponding to
how to solve
>>>> issue #3)
>>>> 
>>>> -Manman
>>>> 
>>>> 
>>>> -eric
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130620/ad3905d1/attachment.html>

David Blaikie

2013-Jun-21 00:39 UTC

head link

[LLVMdev] Proposal: type uniquing of debug info for LTO

On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com>
wrote:>
> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com>
wrote:
>
> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com>
wrote:
>
>
> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>
> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com>
wrote:
>
>
> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
>
>
> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>
> Hi Manman,
>
> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com>
wrote:
>
>
> The intent of this proposal is to speedup compilation of "-flto
-g" for c++
> programs.
> This is based on discussions with Adrian, David and Eric.
>
>
> Thanks for bringing this back to the list. The original thread was
> getting quite long.
>
> ---------------------------
> Problem:
> A single class can be used in multiple source files and the DI (Debug Info)
> class is included in multiple bc files. The duplication of
> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to
> large memory requirement.
>
> As an example, SPEC xalancbmk requires 7GB of memory when compiled with
> -flto -g.
> With a preliminary implementation of type uniquing, the memory usage will
be
> down to 2GB.
>
> In order to unique types, we have to break cycles in the MDNodes.
>
> A simple struct definition
> struct Base {
> int a;
> };
> can cause cycles in MDNodes:
> !12 = metadata !{i32 786451, metadata !13, null, metadata
!"Base", i32 1,
> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [
> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from
]
> !14 = metadata !{metadata !15, metadata !16}
> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata
!"a", i32
> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
> 2, size 32, align 32, offset 0] [from int]
> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata
!"Base",
> metadata !"Base", metadata !"", i32 1, metadata !17, i1
false, i1 false, i32
> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32 1}
;
> [ DW_TAG_subprogram ] [line 1] [Base]
>
> Cycles: !12 -- !14 -- !15 -- !12
>         !12 -- !14 -- !16 -- !12
>
> These cycles make it hard to unique the same struct used in two bc files.
>
> ---------------------------
> How to fix:
>
> We attach a hash value to types to help type uniquing and we also replace
> references to types with their hash values.
> For the above struct "Base", we now have the following MDNodes:
> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base",
i32 1, i64
> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
> offset 0] [from ]
> !6 = metadata !{metadata !7, metadata !9}
> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata
!"a", i32
> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
> 2, size 32, align 32, offset 0] [from int]
> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata
!"Base",
> metadata !"Base", metadata !"", i32 1, metadata !10, i1
false, i1 false, i32
> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1}
;
> [ DW_TAG_subprogram ] [line 1] [Base]
>
> Note that the cycles are gone and !4 has a hash value of 915398439, and the
> references to !4 are replaced with 915398439.
> Thanks Eric for suggesting replacing MD reference with a hash value.
>
>
> In particular I recommended this:
>
> a) For C++ odr replace it with the "hash" that's just a
string
> representing the type name.
> b) For Internal C++ types and all C types replace it with a string
> that's a concatenation of the type name and the name of the compile
> unit.
>
> Yes, that is what we agreed on over email.
>
>
>
> There are a few issues:
> 1> How to generate the hash for a given type?
> With C++'s ODR, it should be enough by using the context and the name
for
> non-internal c++ types.
> For internal c++ types and types of other languages, hash will not be used.
>
>
> Explain this?
>
>
> For a while, I am going to support both hash and MD reference, once
> everything is working with hash,
> I will update all debug info testing cases, turn -gtype-uniquing on, and
> remove the other path.
>
> For internal c++ types, initially they will follow the path of using MD
> references without a hash.
>
>
> My current implementation is to add a few static member functions in MDNode
> to profile DI nodes differently.
> + /// If the array of Vals is for debug info, profile it specially and
> return true.
> + /// If the DI node has a hash value, generate the profile using only the
> hash value and the declaration flag.
> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals,
FoldingSetNodeID
> &ID);
>
> + /// If the MDNode is for debug info, profile it specially and return
true.
> + /// If the DI node has a hash value, generate the profile using only the
> hash value and the declaration flag.
> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID
&ID);
>
> + /// Given a hash value and a flag, generate the profile for later lookup.
> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
> FoldingSetNodeID &ID);
>
> These static functions are called in Metadata.cpp:
> void MDNode::Profile(FoldingSetNodeID &ID) const {
> +  if (profileDebugInfoNode(this, ID))
> +    return;
> +
>
> There are other examples of these in MDNode for handling of specific
> metadata.
> /// Methods for metadata merging.
> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>
> Comments are welcome on whether this violates any layering rule.
>
>
> As I've said many times in email, I don't think this is a good idea
> and would prefer either a or b below. a) is a much simpler solution.
>
> Any reason that why it is not a good idea?
>
>
> Other choices are:
> a> Keep a map in DwarfDebug
> Keep in mind that the map is used at many stages, and it has to be in sync
> with MDNodeSet.
> b> Generalize MDNode to be aware of hash (David can provide more
details)
> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer,
ll
> reader|writer) to be aware of DINode
> We can provide DINode::get(…) to create a DINode. DINode can have its own
> Profile function.
> Other suggestions are welcome.
>
>
> a or b please.
>
> Option a> will require a DwarfDebug pointer in every stage of the
compiler,
> and passing the map to the DI classes.
> A rough estimation is around 100 places.
> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm linker?
> Also the map needs to be in sync with MDNodeSet, maybe using ValueHandle
can
> solve the problem.
>
>
> What about putting the map in LLVMContextImpl?
> It already has a few things specifically for debug info:
> std::vector<DebugRecVH> ScopeRecords;
> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
> …
>
> I remember David mentioned it once and I forgot about the conclusion.
>
>
> I mentioned it only as speculation as to how you were implementing it
> already (but you were doing the profile-changing stuff).
>
> I don't think it should be necessary to have the map (in option (a))
> in such a central location as LLVMContext. It should be usable just
> from DwarfDebug for generation, and DIBuilder can have its own,
> separate map to do similar things during DI building.
>
>
> We also need the map during llvm linking since linking will create new
> MDNodes.
>
>
> I don't understand what you mean by this. IR linking shouldn't need
to
> do anything debug-info-specific, it should just be the normal IR
> approach.
>
> The declaration-v-definition resolution can be done at codegen-time.
> We'll have to walk all the lists of retained types anyway to build the
> map, so we can do declaration-v-definition (keep definitions over
> declarations when we see both) at that point.
>
>
> Yes, you are right that at codegen-time, we can generate the map from the
> lists
> of retained types.
>
> But dumping the linked ll file requires the map when outputting comments of
> the MDNode :]
Depending on which things we print out, but yes, in some cases
(derived types) we do print out the type referenced. I assume the
AsmPrinter can build such a map, then. (in fact, with a few clients
like this, it might be nice to build a bit of an abstraction around it
rather than just using a raw map - something that has a ctor (or I
suppose it could be a factory function) that reads in the right
metadata, walks the compile units, etc, and builds the mapping)
>
> Thanks,
> Manman
>
>
> So any other opinion on putting it in LLVMContext other than it being
> central?
>
> Thanks,
> Manman
>
>
>
> Thanks,
> Manman
>
>
> More details for option b from David
>
> < The alternative I have in mind is a more complete version of what
> < you're proposing - a full MD feature, not an MD feature that's
just
> < barely enough to support the needs of debug info. What we could do is
> < allow the insertion of these MDHash things you spoke about but take it
> < a step further and have MDNode::getOperand walk through the hash &
> < give the value (in this way, DebugInfo wouldn't have to change at
all
> < to handle hashes - if the Metadata APIs are going to be aware of the
> < hashes anyway, they might as well provide this convenience
> < functionality) the metadata feature would also have to have some
> < blessed top-level named metadata that would have a list of hash+MDNode
> < to keep those MDNodes alive (so you wouldn't have to stuff all the
> < types in the retained types list - metadata would provide the full
> < support, not just half of it).
>
>
>
> Transition from current DI Metadata:
> To have a smooth transition, I will add a flag "-gtype-hashing"
for the type
> uniquing work and turn it on by default when we are ready.
>
>
> I'd prefer just make the change to have the front end emit the
"hash"
> (it's not really a hash, it's just a string).
>
> Are you saying no transition period? A single patch to have correct
handling
> of "hash" and to update all existing testing cases?
>
>
> -----------------------------
> Patches:
> Expect the following patches:
> 1> add flag -gtype-hashing
> 2> add hash field to DI types
> 3> modify DIBuilder to use hash instead of MD reference
> 4> related to issue 3
>
>
> These can all be a single patch since it shouldn't be very large if we
> go with a) above. If we go with b) then the MDNode work should be done
> in isolation first and then the debug info on top of it.
>
> What is wrong with smaller patches?
> My estimation for all the above with a) is about 30K + testing cases.
>
>
> 5> backend change (in DwarfDebug|CompileUnit) to support types shared
among
> compile units
> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to ref_addr
>
>
> #5 can and should be done before the rest of them.
>
> I prefer to submit patches according to the flow of the compiler, starting
> from the frontend, then IR, then backend.
> The testing cases will be added for front end, llvm-link and backend.
> Any reason why #5 should be done first?
>
>
> All changes should be local to debug info classes except patch #4.
>
>
> What's patch #4?
>
> Patch #4 above: related to issue 3 (changes corresponding to how to solve
> issue #3)
>
> -Manman
>
>
> -eric
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Jun 2013 - [LLVMdev] Proposal: type uniquing of debug info for LTO

[LLVMdev] Proposal: type uniquing of debug info for LTO

[LLVMdev] Proposal: type uniquing of debug info for LTO

[LLVMdev] Proposal: type uniquing of debug info for LTO

Seemingly Similar Threads