Dave Bartolomeo via llvm-dev
2015-Oct-29 17:11 UTC
[llvm-dev] RFC: CodeView debug info emission in Clang/LLVM
RFC: CodeView debug info emission in Clang/LLVM Overview On Windows, the de facto debug information format is CodeView, most commonly encountered in the form of a .pdb file. This is the format emitted by the Visual C++, C#, and VB.NET compilers, consumed by the Visual Studio debugger and the Windows debugger (WinDbg), and exposed for read-only access via the DIA SDK. The CodeView format has never been publically documented, and Microsoft has never provided an API for emitting CodeView info for native code. Therefore, Clang and LLVM have only been able to emit the small subset of CodeView information that the community has been able to reverse engineer. In order to improve the experience of using Clang and other LLVM-based compilers to target Windows, Microsoft has decided to contribute code to the LLVM project to read and write CodeView debug information, including changes to make Clang and LLVM emit CodeView debug information for C and C++ code. This RFC covers the first phase of this work: Emitting CodeView type information for C and C++. The next phase will be to emit CodeView symbol information for functions and their local variables; I'll send out a separate RFC for that when I get to that phase. I'll start with some background on the CodeView format, and then move on to the proposed design. Overview of the CodeView Debug Information Format "CodeView" is the name we use to refer to the debug record format generated by the Visual C++ compiler and consumed by the Visual Studio debugger, the Windows debugger (WinDbg), and the DIA SDK. CodeView records are contained in either a .pdb file or in an object file. The CodeView records that describe the debug information for a PE image (i.e. a .dll or .exe) are always contained in a corresponding PDB file. The CodeView records that describe the debug information for a COFF object file (.obj) are contained within the .obj itself, although some of the debug information will be stored in a .pdb file if the .obj was compiled with the /Zi or /ZI option. When code is compiled with cl.exe using the /Z7, /Zi, or /ZI option, cl.exe generates two well-known sections in the resulting .obj file: ".debug$T" and ".debug$S". These are known as the "types" section and the "symbols" section, respectively. The types section contains CodeView records that describe all of the data types referenced by symbols in that .obj. The symbols section contains CodeView records that describe all of the symbols defined within the .obj, including functions, global and static data, and local variables. When link.exe is invoked with the /debug option, all of the debug information from the contributing .obj files is combined into a single .pdb file for the linked image. The .debug$T Section The types section of the .obj file contains a short header consisting solely of the version number of the CodeView types format (currently equal to 4), followed by a sequence of CodeView type records. Each type record starts with a 16-bit field holding the length of the record, followed by a 16-bit tag field that identifies the kind of type described by the record. The format of the remainder of the record depends on the tag. Common type record kinds include: - Pointer - Array - Function - Struct - Class - Union - Enum Duplicate type records are folded based on a binary comparison of their contents. Thus, there will be only a single instance of the type record for 'const char*' in a given types section, regardless of the number of uses of that type. When one type record needs to refer to another type record (e.g. a Pointer record referring to the record that describes the referent type of the pointer), it uses a 32-bit "type index", usually abbreviated "TI". A TI with a value less than 0x1000 refers to a well-known type for which no type record actually exists. Examples include primitive types like 'int' or 'wchar_t', and simple pointers to these primitive types. A TI with a value of 0x1000 or greater refers to the another type record in the types section, whose zero-based index is determined by subtracting 0x1000 from the value of the TI. It is an invariant of the types section that a given type record may only use a TI to refer to type records defined earlier in the types section. Thus, no cycles are possible. In order to support types with cyclic dependencies, user-defined types (class, struct, union, enum) can have two records for each type: one to describe the forward declaration, and one to describe the definition. Other records refer to the forward declaration of the type, and only the definition record contains the member list of the type. The debugger matches a forward declaration with its definition based on the qualified name of the type. Type indices are also used within the .debug$S section to refer to types in the .debug$T section. If a given .obj file was compiled with the /Zi or /ZI option, the type records for that .obj are stored in a separate .pdb file, rather than in the .obj file itself. The records in the PDB have exactly the same format as those in the .obj, so there is essentially no functional difference in the debug info itself. When the linker generates the .pdb for an image, it creates a single types section in the .pdb consisting of the transitive closure of all of the type records referenced by any symbol in any of the contributing .objs, with any type indices suitably fixed up to refer to the correct record in the merged types section. The .debug$S Section The symbols section of the .obj file contains several substreams to describe the symbols defined in that .obj. The most common substreams are: - Line Numbers: Contains mappings from code address ranges to source file, line, and column. - Source File Info: Contains the file names and file hashes of source files referenced in the Line Numbers stream. - Symbols: Contains symbol records that describe functions and variables. The Symbols substream is a sequence of records that, like the type records, each begin with a 16-bit size and a 16-bit tag. Common symbol record kinds include: - Global Data - Function - Block Scope - Stack Frame - Frame Pointer-Relative Variable - Register-Relative Variable - Enregistered Variable Unlike type records, some symbol records can be nested. For example, Function records usually contain a Stack Frame record, local variable records, and Block Scope records. Block Scope records can in turn contain more local variable and Block Scope records. When a symbol record needs to refer to a data type, it uses a TI that refers to a record in the types section for the .obj. When the linker generate the .pdb for an image, it creates a separate symbols section in the .pdb for each contributing .obj. The contents of the .obj's symbols section are copied into the corresponding section in the .pdb, fixing up any TIs to refer to the types section of the .pdb, and fixing up any code or data addresses to refer to the correct location in the final linked image. Proposed Design How Debug Info is Generated The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I'll cover the actual representation in a bit more detail below. The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts. Representation of CodeView in LLVM IR DICompileUnit + existing fields + CodeViewTypes : DICodeViewTypes DICodeViewTypes + TypeRecords : MDString[] + UDTSymbols : DICodeViewUDT[] DICodeViewUDT + Name : MDString + TypeIndex : uint32_t DIVariable + existing fields + TypeIndex : uint32_t DISubprogram + existing fields + TypeIndex : uint32_t The existing DICompileUnit node will have a new operand named CodeViewTypes, which points to the new DICodeViewTypes node that describes the CodeView type information for the compilation unit. The DICodeViewTypes node contains two operands: - TypeRecords, an array of MDStrings containing the actual CodeView type records for the compilation unit, sorted in ascending order of type index. - UDTSymbols, and array of DICodeViewUDT nodes describing the user-defined types (class/struct/union/enum) for which CodeView symbol records will need to be emitted by the back-end. The DICodeViewUDT node contains two operands: - Name, an MDString with the name of the symbol as it should appear in the CodeView symbol record. - TypeIndex, a uint32_t holding the CodeView type index of the type record for the user-defined type's definition. The DICodeViewUDT nodes are necessary because they are generally the only references to the definition of the user-defined type. Other uses of that type refer to the forward declaration record for the type, and without a reference to the definition of the type, the linker will discard the definition record when it merges the type information into the PDB. To specify the CodeView type for a variable or function, the DIVariable and DISubprogram nodes will have an additional TypeIndex operand containing the type index of the type record for that variable or function's type. This operand will be set to zero when CodeView debug info is not enabled. The above representation essentially extends the existing DWARF-focused debug metadata to also include CodeView info. This was the least invasive way I found to add CodeView support, but it may not be the right architectural decision. It would also be possible to have the CodeView metadata entirely separate from the DWARF metadata. This would reduce the size of the IR when only one form of debug information was being emitted, which is presumably the common case. However, I expect it would complicate the scenario where both DWARF and CodeView are being emitted; for example, would having two dbg.declare intrinsics for a single local variable confuse existing consumers of LLVM IR? I'm hoping someone more familiar with the existing debug info architecture can provide some guidance here if there's a better way of doing this. New Library - LLVMCodeView The design introduces a new library in LLVM, "LLVMCodeView". This library will contain the code to read and write the CodeView debug info format. The library depends only on the LLVMSupport library, enabling non-LLVM clients to use the library without depending on large portions of LLVM. The LLVMCodeView library is not responsible for translating other forms of information (e.g. LLVM IR, Clang ASTs) to the CodeView format; that work happens in other components. Changes to LLVMCore The LLVMCore library will be extended with the definitions of the new debug metadata nodes and new fields on existing nodes, as described previously. Generating CodeView Type Records in Clang The clangCodeGen library will be extended with a new class, CodeViewTypeTable. This class is the CodeView equivalent of CGDebugInfo for CodeView. It translates Clang types into the appropriate CodeView type record on demand, returning the type index of the new record. This is where most of the interesting work happens. Since all of the type records for a given image are merged together by the linker when creating the final .pdb, having the type records emitting by Clang match those emitted by cl.exe as closely as possible minimizes conflicts when object files built by the two compilers are linked together into the same image. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151029/ed3ca2bc/attachment.html>
Daniel Dilts via llvm-dev
2015-Oct-29 19:42 UTC
[llvm-dev] [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM
I am really excited to see the work for generating CodeView done. I have two questions: 1. Will the CodeView information be publicly documented? 2. Will LLD and LLDB be updated as necessary to support CodeView? On Thu, Oct 29, 2015 at 10:11 AM, Dave Bartolomeo via cfe-dev < cfe-dev at lists.llvm.org> wrote:> *RFC: CodeView debug info emission in Clang/LLVM* > > > > *Overview* > > On Windows, the de facto debug information format is CodeView, most > commonly encountered in the form of a .pdb file. This is the format emitted > by the Visual C++, C#, and VB.NET compilers, consumed by the Visual > Studio debugger and the Windows debugger (WinDbg), and exposed for > read-only access via the DIA SDK. The CodeView format has never been > publically documented, and Microsoft has never provided an API for emitting > CodeView info for native code. Therefore, Clang and LLVM have only been > able to emit the small subset of CodeView information that the community > has been able to reverse engineer. > > > > In order to improve the experience of using Clang and other LLVM-based > compilers to target Windows, Microsoft has decided to contribute code to > the LLVM project to read and write CodeView debug information, including > changes to make Clang and LLVM emit CodeView debug information for C and > C++ code. This RFC covers the first phase of this work: Emitting CodeView > type information for C and C++. The next phase will be to emit CodeView > symbol information for functions and their local variables; I’ll send out a > separate RFC for that when I get to that phase. > > > > I’ll start with some background on the CodeView format, and then move on > to the proposed design. > > > > *Overview of the CodeView Debug Information Format* > > “CodeView” is the name we use to refer to the debug record format > generated by the Visual C++ compiler and consumed by the Visual Studio > debugger, the Windows debugger (WinDbg), and the DIA SDK. CodeView records > are contained in either a .pdb file or in an object file. The CodeView > records that describe the debug information for a PE image (i.e. a .dll or > .exe) are always contained in a corresponding PDB file. The CodeView > records that describe the debug information for a COFF object file (.obj) > are contained within the .obj itself, although some of the debug > information will be stored in a .pdb file if the .obj was compiled with the > /Zi or /ZI option. > > > > When code is compiled with cl.exe using the /Z7, /Zi, or /ZI option, > cl.exe generates two well-known sections in the resulting .obj file: > “.debug$T” and “.debug$S”. These are known as the “types” section and the > “symbols” section, respectively. The types section contains CodeView > records that describe all of the data types referenced by symbols in that > .obj. The symbols section contains CodeView records that describe all of > the symbols defined within the .obj, including functions, global and static > data, and local variables. When link.exe is invoked with the /debug option, > all of the debug information from the contributing .obj files is combined > into a single .pdb file for the linked image. > > > > *The .debug$T Section* > > The types section of the .obj file contains a short header consisting > solely of the version number of the CodeView types format (currently equal > to 4), followed by a sequence of CodeView type records. Each type record > starts with a 16-bit field holding the length of the record, followed by a > 16-bit tag field that identifies the kind of type described by the record. > The format of the remainder of the record depends on the tag. Common type > record kinds include: > > - Pointer > > - Array > > - Function > > - Struct > > - Class > > - Union > > - Enum > > > > Duplicate type records are folded based on a binary comparison of their > contents. Thus, there will be only a single instance of the type record for > ‘const char*’ in a given types section, regardless of the number of uses of > that type. > > When one type record needs to refer to another type record (e.g. a Pointer > record referring to the record that describes the referent type of the > pointer), it uses a 32-bit “type index”, usually abbreviated “TI”. A TI > with a value less than 0x1000 refers to a well-known type for which no type > record actually exists. Examples include primitive types like ‘int’ or > ‘wchar_t’, and simple pointers to these primitive types. A TI with a value > of 0x1000 or greater refers to the another type record in the types > section, whose zero-based index is determined by subtracting 0x1000 from > the value of the TI. It is an invariant of the types section that a given > type record may only use a TI to refer to type records defined earlier in > the types section. Thus, no cycles are possible. In order to support types > with cyclic dependencies, user-defined types (class, struct, union, enum) > can have two records for each type: one to describe the forward > declaration, and one to describe the definition. Other records refer to the > forward declaration of the type, and only the definition record contains > the member list of the type. The debugger matches a forward declaration > with its definition based on the qualified name of the type. > > > > Type indices are also used within the .debug$S section to refer to types > in the .debug$T section. > > > > If a given .obj file was compiled with the /Zi or /ZI option, the type > records for that .obj are stored in a separate .pdb file, rather than in > the .obj file itself. The records in the PDB have exactly the same format > as those in the .obj, so there is essentially no functional difference in > the debug info itself. > > > > When the linker generates the .pdb for an image, it creates a single types > section in the .pdb consisting of the transitive closure of all of the type > records referenced by any symbol in any of the contributing .objs, with any > type indices suitably fixed up to refer to the correct record in the merged > types section. > > > > *The .debug$S Section* > > The symbols section of the .obj file contains several substreams to > describe the symbols defined in that .obj. The most common substreams are: > > - Line Numbers: Contains mappings from code address ranges to > source file, line, and column. > > - Source File Info: Contains the file names and file hashes of > source files referenced in the Line Numbers stream. > > - Symbols: Contains symbol records that describe functions and > variables. > > > > The Symbols substream is a sequence of records that, like the type > records, each begin with a 16-bit size and a 16-bit tag. Common symbol > record kinds include: > > - Global Data > > - Function > > - Block Scope > > - Stack Frame > > - Frame Pointer-Relative Variable > > - Register-Relative Variable > > - Enregistered Variable > > > > Unlike type records, some symbol records can be nested. For example, > Function records usually contain a Stack Frame record, local variable > records, and Block Scope records. Block Scope records can in turn contain > more local variable and Block Scope records. > > > > When a symbol record needs to refer to a data type, it uses a TI that > refers to a record in the types section for the .obj. > > > > When the linker generate the .pdb for an image, it creates a separate > symbols section in the .pdb for each contributing .obj. The contents of the > .obj’s symbols section are copied into the corresponding section in the > .pdb, fixing up any TIs to refer to the types section of the .pdb, and > fixing up any code or data addresses to refer to the correct location in > the final linked image. > > > > *Proposed Design* > > *How Debug Info is Generated* > > The CodeView type records for a compilation unit will be generated by the > front-end for the source language (Clang, in the case of C and C++). The > front-end has access to the full type system and AST of the language, which > is necessary to generate accurate debug type info. The type records will be > represented as metadata in the LLVM IR, similar to how DWARF debug info is > represented. I’ll cover the actual representation in a bit more detail > below. > > The LLVM back-end will be responsible for emitting the CodeView type > records from the IR into the output .obj file. Since the type records will > already be in the correct format, this is essentially just a copy. No > inspection of the type records is necessary within LLVM. The back-end will > also be responsible for generating CodeView symbol records, line numbers, > and source file info for any functions and data defined in the compilation > unit. The back-end is the logical place to do this because only the > back-end knows the code addresses, data addresses, and stack frame layouts. > > > > *Representation of CodeView in LLVM IR* > > DICompileUnit > > + e*xisting fields* > > + CodeViewTypes : DICodeViewTypes > > > > DICodeViewTypes > > + TypeRecords : MDString[] > > + UDTSymbols : DICodeViewUDT[] > > > > DICodeViewUDT > > + Name : MDString > > + TypeIndex : uint32_t > > > > DIVariable > > + *existing fields* > > + TypeIndex : uint32_t > > > > DISubprogram > > + *existing fields* > > + TypeIndex : uint32_t > > The existing DICompileUnit node will have a new operand named > CodeViewTypes, which points to the new DICodeViewTypes node that describes > the CodeView type information for the compilation unit. > > > > The DICodeViewTypes node contains two operands: > > - TypeRecords, an array of MDStrings containing the actual > CodeView type records for the compilation unit, sorted in ascending order > of type index. > > - UDTSymbols, and array of DICodeViewUDT nodes describing the > user-defined types (class/struct/union/enum) for which CodeView symbol > records will need to be emitted by the back-end. > > > > The DICodeViewUDT node contains two operands: > > - Name, an MDString with the name of the symbol as it should > appear in the CodeView symbol record. > > - TypeIndex, a uint32_t holding the CodeView type index of the > type record for the user-defined type’s definition. > > > > The DICodeViewUDT nodes are necessary because they are generally the only > references to the definition of the user-defined type. Other uses of that > type refer to the forward declaration record for the type, and without a > reference to the definition of the type, the linker will discard the > definition record when it merges the type information into the PDB. > > > > To specify the CodeView type for a variable or function, the DIVariable > and DISubprogram nodes will have an additional TypeIndex operand containing > the type index of the type record for that variable or function’s type. > This operand will be set to zero when CodeView debug info is not enabled. > > > > The above representation essentially extends the existing DWARF-focused > debug metadata to also include CodeView info. This was the least invasive > way I found to add CodeView support, but it may not be the right > architectural decision. It would also be possible to have the CodeView > metadata entirely separate from the DWARF metadata. This would reduce the > size of the IR when only one form of debug information was being emitted, > which is presumably the common case. However, I expect it would complicate > the scenario where both DWARF and CodeView are being emitted; for example, > would having two dbg.declare intrinsics for a single local variable confuse > existing consumers of LLVM IR? I’m hoping someone more familiar with the > existing debug info architecture can provide some guidance here if there’s > a better way of doing this. > > > > *New Library - LLVMCodeView* > > The design introduces a new library in LLVM, “LLVMCodeView”. This library > will contain the code to read and write the CodeView debug info format. The > library depends only on the LLVMSupport library, enabling non-LLVM clients > to use the library without depending on large portions of LLVM. The > LLVMCodeView library is *not* responsible for translating other forms of > information (e.g. LLVM IR, Clang ASTs) to the CodeView format; that work > happens in other components. > > > > *Changes to LLVMCore* > > The LLVMCore library will be extended with the definitions of the new > debug metadata nodes and new fields on existing nodes, as described > previously. > > > > *Generating CodeView Type Records in Clang* > > The clangCodeGen library will be extended with a new class, > CodeViewTypeTable. This class is the CodeView equivalent of CGDebugInfo for > CodeView. It translates Clang types into the appropriate CodeView type > record on demand, returning the type index of the new record. This is where > most of the interesting work happens. Since all of the type records for a > given image are merged together by the linker when creating the final .pdb, > having the type records emitting by Clang match those emitted by cl.exe as > closely as possible minimizes conflicts when object files built by the two > compilers are linked together into the same image. > > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151029/d127acb4/attachment.html>
Adrian Prantl via llvm-dev
2015-Oct-29 21:08 UTC
[llvm-dev] RFC: CodeView debug info emission in Clang/LLVM
> On Oct 29, 2015, at 10:11 AM, Dave Bartolomeo via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Proposed Design > How Debug Info is Generated > The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I’ll cover the actual representation in a bit more detail below. > The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts.Thanks for proposing this. How different are the type records from the type information we currently have in LLVM's DIType hierarchy? Would it be feasible to move the logic for generating type records from LLVM metadata into the backend? This way a frontend could be agnostic about the debug information format. -- adrian
Saleem Abdulrasool via llvm-dev
2015-Oct-30 05:02 UTC
[llvm-dev] RFC: CodeView debug info emission in Clang/LLVM
On Thu, Oct 29, 2015 at 2:08 PM, Adrian Prantl via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Oct 29, 2015, at 10:11 AM, Dave Bartolomeo via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > Proposed Design > > How Debug Info is Generated > > The CodeView type records for a compilation unit will be generated by > the front-end for the source language (Clang, in the case of C and C++). > The front-end has access to the full type system and AST of the language, > which is necessary to generate accurate debug type info. The type records > will be represented as metadata in the LLVM IR, similar to how DWARF debug > info is represented. I’ll cover the actual representation in a bit more > detail below. > > The LLVM back-end will be responsible for emitting the CodeView type > records from the IR into the output .obj file. Since the type records will > already be in the correct format, this is essentially just a copy. No > inspection of the type records is necessary within LLVM. The back-end will > also be responsible for generating CodeView symbol records, line numbers, > and source file info for any functions and data defined in the compilation > unit. The back-end is the logical place to do this because only the > back-end knows the code addresses, data addresses, and stack frame layouts. > > Thanks for proposing this. > > How different are the type records from the type information we currently > have in LLVM's DIType hierarchy? Would it be feasible to move the logic for > generating type records from LLVM metadata into the backend? This way a > frontend could be agnostic about the debug information format. >I think that this really is the path we want to follow. If the current metadata we emit is insufficient, we should augment it with additional information sufficient to generate the necessary data in the backend. The same annotations would then be able able to generate one OR both debug info formats.> -- adrian > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Saleem Abdulrasool compnerd (at) compnerd (dot) org -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151029/dddb3d87/attachment.html>
Dave Bartolomeo via llvm-dev
2015-Oct-30 20:25 UTC
[llvm-dev] [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM
Yes, we will be publically documenting the CodeView format. We’re in the process of making our internal CodeView documentation fit for public consumption. As far as LLD/LLDB goes, we (Microsoft) don’t have any current plans to implement the CodeView support in those projects ourselves. However, we certainly want to make sure that the code and documentation we release to support CodeView within LLVM is sufficient for any other interested member of the community to implement that support. -Dave From: Daniel Dilts [mailto:diltsman at gmail.com] Sent: Thursday, October 29, 2015 12:42 PM To: Dave Bartolomeo <Dave.Bartolomeo at microsoft.com> Cc: llvm-dev at lists.llvm.org; cfe-dev at lists.llvm.org Subject: Re: [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM I am really excited to see the work for generating CodeView done. I have two questions: 1. Will the CodeView information be publicly documented? 2. Will LLD and LLDB be updated as necessary to support CodeView? On Thu, Oct 29, 2015 at 10:11 AM, Dave Bartolomeo via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote: RFC: CodeView debug info emission in Clang/LLVM Overview On Windows, the de facto debug information format is CodeView, most commonly encountered in the form of a .pdb file. This is the format emitted by the Visual C++, C#, and VB.NET<na01.safelinks.protection.outlook.com/?url=http://VB.NET&data=01|01|Dave.Bartolomeo@microsoft.com|973d2475a41141faf32708d2e0b940c4|72f988bf86f141af91ab2d7cd011db47|1&sdata=Vs+623JOO9f6SN+U6h/jV7DW5MlvRV2ymK/I/UyB0Yc=> compilers, consumed by the Visual Studio debugger and the Windows debugger (WinDbg), and exposed for read-only access via the DIA SDK. The CodeView format has never been publically documented, and Microsoft has never provided an API for emitting CodeView info for native code. Therefore, Clang and LLVM have only been able to emit the small subset of CodeView information that the community has been able to reverse engineer. In order to improve the experience of using Clang and other LLVM-based compilers to target Windows, Microsoft has decided to contribute code to the LLVM project to read and write CodeView debug information, including changes to make Clang and LLVM emit CodeView debug information for C and C++ code. This RFC covers the first phase of this work: Emitting CodeView type information for C and C++. The next phase will be to emit CodeView symbol information for functions and their local variables; I’ll send out a separate RFC for that when I get to that phase. I’ll start with some background on the CodeView format, and then move on to the proposed design. Overview of the CodeView Debug Information Format “CodeView” is the name we use to refer to the debug record format generated by the Visual C++ compiler and consumed by the Visual Studio debugger, the Windows debugger (WinDbg), and the DIA SDK. CodeView records are contained in either a .pdb file or in an object file. The CodeView records that describe the debug information for a PE image (i.e. a .dll or .exe) are always contained in a corresponding PDB file. The CodeView records that describe the debug information for a COFF object file (.obj) are contained within the .obj itself, although some of the debug information will be stored in a .pdb file if the .obj was compiled with the /Zi or /ZI option. When code is compiled with cl.exe using the /Z7, /Zi, or /ZI option, cl.exe generates two well-known sections in the resulting .obj file: “.debug$T” and “.debug$S”. These are known as the “types” section and the “symbols” section, respectively. The types section contains CodeView records that describe all of the data types referenced by symbols in that .obj. The symbols section contains CodeView records that describe all of the symbols defined within the .obj, including functions, global and static data, and local variables. When link.exe is invoked with the /debug option, all of the debug information from the contributing .obj files is combined into a single .pdb file for the linked image. The .debug$T Section The types section of the .obj file contains a short header consisting solely of the version number of the CodeView types format (currently equal to 4), followed by a sequence of CodeView type records. Each type record starts with a 16-bit field holding the length of the record, followed by a 16-bit tag field that identifies the kind of type described by the record. The format of the remainder of the record depends on the tag. Common type record kinds include: - Pointer - Array - Function - Struct - Class - Union - Enum Duplicate type records are folded based on a binary comparison of their contents. Thus, there will be only a single instance of the type record for ‘const char*’ in a given types section, regardless of the number of uses of that type. When one type record needs to refer to another type record (e.g. a Pointer record referring to the record that describes the referent type of the pointer), it uses a 32-bit “type index”, usually abbreviated “TI”. A TI with a value less than 0x1000 refers to a well-known type for which no type record actually exists. Examples include primitive types like ‘int’ or ‘wchar_t’, and simple pointers to these primitive types. A TI with a value of 0x1000 or greater refers to the another type record in the types section, whose zero-based index is determined by subtracting 0x1000 from the value of the TI. It is an invariant of the types section that a given type record may only use a TI to refer to type records defined earlier in the types section. Thus, no cycles are possible. In order to support types with cyclic dependencies, user-defined types (class, struct, union, enum) can have two records for each type: one to describe the forward declaration, and one to describe the definition. Other records refer to the forward declaration of the type, and only the definition record contains the member list of the type. The debugger matches a forward declaration with its definition based on the qualified name of the type. Type indices are also used within the .debug$S section to refer to types in the .debug$T section. If a given .obj file was compiled with the /Zi or /ZI option, the type records for that .obj are stored in a separate .pdb file, rather than in the .obj file itself. The records in the PDB have exactly the same format as those in the .obj, so there is essentially no functional difference in the debug info itself. When the linker generates the .pdb for an image, it creates a single types section in the .pdb consisting of the transitive closure of all of the type records referenced by any symbol in any of the contributing .objs, with any type indices suitably fixed up to refer to the correct record in the merged types section. The .debug$S Section The symbols section of the .obj file contains several substreams to describe the symbols defined in that .obj. The most common substreams are: - Line Numbers: Contains mappings from code address ranges to source file, line, and column. - Source File Info: Contains the file names and file hashes of source files referenced in the Line Numbers stream. - Symbols: Contains symbol records that describe functions and variables. The Symbols substream is a sequence of records that, like the type records, each begin with a 16-bit size and a 16-bit tag. Common symbol record kinds include: - Global Data - Function - Block Scope - Stack Frame - Frame Pointer-Relative Variable - Register-Relative Variable - Enregistered Variable Unlike type records, some symbol records can be nested. For example, Function records usually contain a Stack Frame record, local variable records, and Block Scope records. Block Scope records can in turn contain more local variable and Block Scope records. When a symbol record needs to refer to a data type, it uses a TI that refers to a record in the types section for the .obj. When the linker generate the .pdb for an image, it creates a separate symbols section in the .pdb for each contributing .obj. The contents of the .obj’s symbols section are copied into the corresponding section in the .pdb, fixing up any TIs to refer to the types section of the .pdb, and fixing up any code or data addresses to refer to the correct location in the final linked image. Proposed Design How Debug Info is Generated The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I’ll cover the actual representation in a bit more detail below. The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts. Representation of CodeView in LLVM IR DICompileUnit + existing fields + CodeViewTypes : DICodeViewTypes DICodeViewTypes + TypeRecords : MDString[] + UDTSymbols : DICodeViewUDT[] DICodeViewUDT + Name : MDString + TypeIndex : uint32_t DIVariable + existing fields + TypeIndex : uint32_t DISubprogram + existing fields + TypeIndex : uint32_t The existing DICompileUnit node will have a new operand named CodeViewTypes, which points to the new DICodeViewTypes node that describes the CodeView type information for the compilation unit. The DICodeViewTypes node contains two operands: - TypeRecords, an array of MDStrings containing the actual CodeView type records for the compilation unit, sorted in ascending order of type index. - UDTSymbols, and array of DICodeViewUDT nodes describing the user-defined types (class/struct/union/enum) for which CodeView symbol records will need to be emitted by the back-end. The DICodeViewUDT node contains two operands: - Name, an MDString with the name of the symbol as it should appear in the CodeView symbol record. - TypeIndex, a uint32_t holding the CodeView type index of the type record for the user-defined type’s definition. The DICodeViewUDT nodes are necessary because they are generally the only references to the definition of the user-defined type. Other uses of that type refer to the forward declaration record for the type, and without a reference to the definition of the type, the linker will discard the definition record when it merges the type information into the PDB. To specify the CodeView type for a variable or function, the DIVariable and DISubprogram nodes will have an additional TypeIndex operand containing the type index of the type record for that variable or function’s type. This operand will be set to zero when CodeView debug info is not enabled. The above representation essentially extends the existing DWARF-focused debug metadata to also include CodeView info. This was the least invasive way I found to add CodeView support, but it may not be the right architectural decision. It would also be possible to have the CodeView metadata entirely separate from the DWARF metadata. This would reduce the size of the IR when only one form of debug information was being emitted, which is presumably the common case. However, I expect it would complicate the scenario where both DWARF and CodeView are being emitted; for example, would having two dbg.declare intrinsics for a single local variable confuse existing consumers of LLVM IR? I’m hoping someone more familiar with the existing debug info architecture can provide some guidance here if there’s a better way of doing this. New Library - LLVMCodeView The design introduces a new library in LLVM, “LLVMCodeView”. This library will contain the code to read and write the CodeView debug info format. The library depends only on the LLVMSupport library, enabling non-LLVM clients to use the library without depending on large portions of LLVM. The LLVMCodeView library is not responsible for translating other forms of information (e.g. LLVM IR, Clang ASTs) to the CodeView format; that work happens in other components. Changes to LLVMCore The LLVMCore library will be extended with the definitions of the new debug metadata nodes and new fields on existing nodes, as described previously. Generating CodeView Type Records in Clang The clangCodeGen library will be extended with a new class, CodeViewTypeTable. This class is the CodeView equivalent of CGDebugInfo for CodeView. It translates Clang types into the appropriate CodeView type record on demand, returning the type index of the new record. This is where most of the interesting work happens. Since all of the type records for a given image are merged together by the linker when creating the final .pdb, having the type records emitting by Clang match those emitted by cl.exe as closely as possible minimizes conflicts when object files built by the two compilers are linked together into the same image. _______________________________________________ cfe-dev mailing list cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev<na01.safelinks.protection.outlook.com/?url=http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev&data=01|01|Dave.Bartolomeo@microsoft.com|973d2475a41141faf32708d2e0b940c4|72f988bf86f141af91ab2d7cd011db47|1&sdata=1i9BQR+XswTN0xdKeDeOA79LT1vZMGHknnJEAEMTLrk=> -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151030/2e9dbb86/attachment.html>
Reid Kleckner via llvm-dev
2015-Nov-03 15:51 UTC
[llvm-dev] [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM
On Thu, Oct 29, 2015 at 12:42 PM, Daniel Dilts via cfe-dev < cfe-dev at lists.llvm.org> wrote:> 2. Will LLD and LLDB be updated as necessary to support CodeView? >Rui is is looking at making LLD link codeview from object files into PDBs. Zachary Turner intends to add PDB reading support to LLDB. We already have a PDB implementation of DIContext in lib/DebugInfo that uses PDBs. The only client is currently llvm-symbolizer, but the idea was that LLDB could use it, and eventually we should shift it off DIA and over to something cross-platform. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151103/3edb7984/attachment-0001.html>
Reid Kleckner via llvm-dev
2016-Mar-03 01:19 UTC
[llvm-dev] [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM
Circling back around 4 months later... I now believe that we should just let the frontend generate CV type info. It's really not worth the hassle to try to have a common representation. Enough C++ ABI-specific information leaks into the format that it's really better to avoid trying to create a union of DWARF and CV type info in LLVM DI metadata. We were able to reuse all the other non-type DI metadata, such as location info and scope info, to emit inline line tables and variable locations, so I think we did OK on reusing the existing infrastructure. Compromising at not reusing the type representation seems OK. I haven't come up with any ideas better than the design that Dave Bartolomeo outlined below, so I think we should go ahead with that. One thing I considered was extending DITypeRef to be a union between MDString*, DIType*, and a type index, but I think that's too invasive. I also don't want to make a whole DIType heap allocation just to wrap a 32-bit type index, so I'm in favor of putting the indices into DISubprogram and DIVariable. Any thoughts on this plan? On Thu, Oct 29, 2015 at 10:11 AM, Dave Bartolomeo via cfe-dev < cfe-dev at lists.llvm.org> wrote:> > *Proposed Design* > > *How Debug Info is Generated* > > The CodeView type records for a compilation unit will be generated by the > front-end for the source language (Clang, in the case of C and C++). The > front-end has access to the full type system and AST of the language, which > is necessary to generate accurate debug type info. The type records will be > represented as metadata in the LLVM IR, similar to how DWARF debug info is > represented. I’ll cover the actual representation in a bit more detail > below. > > The LLVM back-end will be responsible for emitting the CodeView type > records from the IR into the output .obj file. Since the type records will > already be in the correct format, this is essentially just a copy. No > inspection of the type records is necessary within LLVM. The back-end will > also be responsible for generating CodeView symbol records, line numbers, > and source file info for any functions and data defined in the compilation > unit. The back-end is the logical place to do this because only the > back-end knows the code addresses, data addresses, and stack frame layouts. > > > > *Representation of CodeView in LLVM IR* > > DICompileUnit > > + e*xisting fields* > > + CodeViewTypes : DICodeViewTypes > > > > DICodeViewTypes > > + TypeRecords : MDString[] > > + UDTSymbols : DICodeViewUDT[] > > > > DICodeViewUDT > > + Name : MDString > > + TypeIndex : uint32_t > > > > DIVariable > > + *existing fields* > > + TypeIndex : uint32_t > > > > DISubprogram > > + *existing fields* > > + TypeIndex : uint32_t > > The existing DICompileUnit node will have a new operand named > CodeViewTypes, which points to the new DICodeViewTypes node that describes > the CodeView type information for the compilation unit. > > > > The DICodeViewTypes node contains two operands: > > - TypeRecords, an array of MDStrings containing the actual > CodeView type records for the compilation unit, sorted in ascending order > of type index. > > - UDTSymbols, and array of DICodeViewUDT nodes describing the > user-defined types (class/struct/union/enum) for which CodeView symbol > records will need to be emitted by the back-end. > > > > The DICodeViewUDT node contains two operands: > > - Name, an MDString with the name of the symbol as it should > appear in the CodeView symbol record. > > - TypeIndex, a uint32_t holding the CodeView type index of the > type record for the user-defined type’s definition. > > > > The DICodeViewUDT nodes are necessary because they are generally the only > references to the definition of the user-defined type. Other uses of that > type refer to the forward declaration record for the type, and without a > reference to the definition of the type, the linker will discard the > definition record when it merges the type information into the PDB. > > > > To specify the CodeView type for a variable or function, the DIVariable > and DISubprogram nodes will have an additional TypeIndex operand containing > the type index of the type record for that variable or function’s type. > This operand will be set to zero when CodeView debug info is not enabled. > > > > The above representation essentially extends the existing DWARF-focused > debug metadata to also include CodeView info. This was the least invasive > way I found to add CodeView support, but it may not be the right > architectural decision. It would also be possible to have the CodeView > metadata entirely separate from the DWARF metadata. This would reduce the > size of the IR when only one form of debug information was being emitted, > which is presumably the common case. However, I expect it would complicate > the scenario where both DWARF and CodeView are being emitted; for example, > would having two dbg.declare intrinsics for a single local variable confuse > existing consumers of LLVM IR? I’m hoping someone more familiar with the > existing debug info architecture can provide some guidance here if there’s > a better way of doing this. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20160302/39e82c4b/attachment.html>
David Blaikie via llvm-dev
2016-Mar-03 18:26 UTC
[llvm-dev] [cfe-dev] RFC: CodeView debug info emission in Clang/LLVM
I think it'd be reasonable to at least figure out a good way to do type references consistently across the two schemes, but I'm OK with the idea of having a blob of opaque type information for different debug info formats, created by frontends (& don't mind if the library for building that blob live in LLVM or Clang for now - the DWARF one at least would probably live in LLVM because type info and other DWARF are described by similar/the same constructs (DIEs, abbrevs, etc) - but it seems like that's not the case for PDB, so there might not be any code to share between LLVM's CodeView needs and the type info construction - then it's just a matter of whether pushing that library down into LLVM for other frontends to use would be good, which it probably will be at some point, so if it goes into Clang I'd at least try to keep it pretty well separated) Potentially that consistency could be created by going the other way - replace DITypeRef with an int, then have the retained types list be the int->type mapping. Skipping the mangled names. (& skip the retained types list for CV/PDB) - Dave On Wed, Mar 2, 2016 at 5:19 PM, Reid Kleckner via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Circling back around 4 months later... > > I now believe that we should just let the frontend generate CV type info. > It's really not worth the hassle to try to have a common representation. > Enough C++ ABI-specific information leaks into the format that it's really > better to avoid trying to create a union of DWARF and CV type info in LLVM > DI metadata. We were able to reuse all the other non-type DI metadata, such > as location info and scope info, to emit inline line tables and variable > locations, so I think we did OK on reusing the existing infrastructure. > Compromising at not reusing the type representation seems OK. > > I haven't come up with any ideas better than the design that Dave > Bartolomeo outlined below, so I think we should go ahead with that. One > thing I considered was extending DITypeRef to be a union between MDString*, > DIType*, and a type index, but I think that's too invasive. I also don't > want to make a whole DIType heap allocation just to wrap a 32-bit type > index, so I'm in favor of putting the indices into DISubprogram and > DIVariable. > > Any thoughts on this plan? > > On Thu, Oct 29, 2015 at 10:11 AM, Dave Bartolomeo via cfe-dev < > cfe-dev at lists.llvm.org> wrote: >> >> *Proposed Design* >> >> *How Debug Info is Generated* >> >> The CodeView type records for a compilation unit will be generated by the >> front-end for the source language (Clang, in the case of C and C++). The >> front-end has access to the full type system and AST of the language, which >> is necessary to generate accurate debug type info. The type records will be >> represented as metadata in the LLVM IR, similar to how DWARF debug info is >> represented. I’ll cover the actual representation in a bit more detail >> below. >> >> The LLVM back-end will be responsible for emitting the CodeView type >> records from the IR into the output .obj file. Since the type records will >> already be in the correct format, this is essentially just a copy. No >> inspection of the type records is necessary within LLVM. The back-end will >> also be responsible for generating CodeView symbol records, line numbers, >> and source file info for any functions and data defined in the compilation >> unit. The back-end is the logical place to do this because only the >> back-end knows the code addresses, data addresses, and stack frame layouts. >> >> >> >> *Representation of CodeView in LLVM IR* >> >> DICompileUnit >> >> + e*xisting fields* >> >> + CodeViewTypes : DICodeViewTypes >> >> >> >> DICodeViewTypes >> >> + TypeRecords : MDString[] >> >> + UDTSymbols : DICodeViewUDT[] >> >> >> >> DICodeViewUDT >> >> + Name : MDString >> >> + TypeIndex : uint32_t >> >> >> >> DIVariable >> >> + *existing fields* >> >> + TypeIndex : uint32_t >> >> >> >> DISubprogram >> >> + *existing fields* >> >> + TypeIndex : uint32_t >> >> The existing DICompileUnit node will have a new operand named >> CodeViewTypes, which points to the new DICodeViewTypes node that describes >> the CodeView type information for the compilation unit. >> >> >> >> The DICodeViewTypes node contains two operands: >> >> - TypeRecords, an array of MDStrings containing the actual >> CodeView type records for the compilation unit, sorted in ascending order >> of type index. >> >> - UDTSymbols, and array of DICodeViewUDT nodes describing the >> user-defined types (class/struct/union/enum) for which CodeView symbol >> records will need to be emitted by the back-end. >> >> >> >> The DICodeViewUDT node contains two operands: >> >> - Name, an MDString with the name of the symbol as it should >> appear in the CodeView symbol record. >> >> - TypeIndex, a uint32_t holding the CodeView type index of the >> type record for the user-defined type’s definition. >> >> >> >> The DICodeViewUDT nodes are necessary because they are generally the only >> references to the definition of the user-defined type. Other uses of that >> type refer to the forward declaration record for the type, and without a >> reference to the definition of the type, the linker will discard the >> definition record when it merges the type information into the PDB. >> >> >> >> To specify the CodeView type for a variable or function, the DIVariable >> and DISubprogram nodes will have an additional TypeIndex operand containing >> the type index of the type record for that variable or function’s type. >> This operand will be set to zero when CodeView debug info is not enabled. >> >> >> >> The above representation essentially extends the existing DWARF-focused >> debug metadata to also include CodeView info. This was the least invasive >> way I found to add CodeView support, but it may not be the right >> architectural decision. It would also be possible to have the CodeView >> metadata entirely separate from the DWARF metadata. This would reduce the >> size of the IR when only one form of debug information was being emitted, >> which is presumably the common case. However, I expect it would complicate >> the scenario where both DWARF and CodeView are being emitted; for example, >> would having two dbg.declare intrinsics for a single local variable confuse >> existing consumers of LLVM IR? I’m hoping someone more familiar with the >> existing debug info architecture can provide some guidance here if there’s >> a better way of doing this. >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20160303/00c79964/attachment.html>