Necip Yildiran via llvm-dev
2021-Jun-10 19:48 UTC
[llvm-dev] [RFC] Computing, storing, and restoring conservative call graphs with LLVM
Hi all, Please find the RFC for computing, storing, and restoring conservative call graphs with LLVM below. This is an early version of the proposed feature, and we are still exploring the ways to implement it. Comments and suggestions are welcome. Summary ====== Inferring indirect call targets from a binary is challenging without source-level information. Hence, reconstruction of a fine-grained call graph from the binary is unfeasible for indirect/virtual calls. To address this, we propose to implement a feature in LLVM to collect the necessary information to construct the call graph later while the source information is present (i.e., during compilation), and store it in a non-code section of the binary. Additionally, we propose to implement a stand-alone tool, llvm-discg, or a feature in llvm-objdump to extract and serialize the call graph. Together, these will allow recovering a conservative (i.e., possibly bigger than actual to avoid missing information) and high-precision call graph only from the binary. In the following, we discuss the motivation for the proposed changes, and describe how we plan to construct, store, and reconstruct the call graph. Background and motivation ======================== Source code provides rich semantic information such as types, which is mostly lost when it is compiled to a binary. A coarse-grained call graph can be recovered when indirect calls are allowed to target any function in the program, resulting in poor precision. On the other hand, source-level information can easily increase the precision of the call graphs, e.g., by restricting the target of an indirect call to functions with matching type. It is infeasible to fully reconstruct the call graph for the binary even with the source in hand. The call graph for the source is possibly different than for the binary due to removed or inserted function calls, e.g., tail call optimizations or new function calls to libc. Therefore, construction of a call graph during compilation is favorable. Since the call graph belongs to a specific compilation of the source, a non-code section in the binary is the right scope to store such information. Use case ======= Many program analysis tools (bug detectors, profilers, etc.) collect call stack traces during execution, to later present them in diagnostic reports. In some cases, like with AddressSanitizer [1], stack trace collection is not a performance critical operation since the rest of the tool’s overhead is much higher. In other cases, such as statistical heap profilers, it is also not a performance bottleneck since the profilers sample a small fraction of events. For memory safety error detectors that use the upcoming Arm Memory Tagging Extension (Arm MTE, [2]), however, stack trace collection will become a significant bottleneck, with respect to CPU and RAM usage, because stack traces will need to be collected on every malloc() and free() call. Reducing this overhead will enable collecting stack traces and improve quality of diagnostics for MTE-enabled dynamic analysis in production. Our goal is to achieve efficient stack trace collection and reconstruction via offline reverse call graph traversal. It involves emitting stack trace hash (instead of the full trace) during runtime; and reconstructing the full stack trace offline from a stack trace hash and a call graph when needed. The proposed LLVM features will enable this by reconstructing an accurate call graph from the binary. Construction of the conservative call graph ========================================== The call graph for direct calls is trivial to obtain. For indirect or virtual calls, the conservative assumption will be made: all functions with matching signature (with relaxations, e.g., see -fsanitize-cfi-icall-generalize-pointers [3]) will be assumed to be a possible target. To further narrow down, a function will not be considered as a target unless it has external linkage or its address is taken. Further narrowing for C++ virtual calls can be done later. The possible targets of indirect calls will be split into disjoint classes of unique function signatures. In the call graph, an indirect call will have an edge to the receiver function class with matching signature to reflect the possible targets for the call. The call graph will be saved in a non-code section of the binary, i.e., .callgraph. The functions and call sites will be identified with unique function entry and call site labels in the machine code, which can be translated into instruction addresses. The hash of function signatures will be used to identify and refer to the disjoint classes of indirect targets. To achieve this in LLVM, a new flag will be introduced, which constructs and emits the call graph in the compiler backend, possibly implemented in llvm/lib/codeGen/AsmPrinter/AsmPrinter.cpp. Call graph layout in the binary ============================== For direct calls, no additional information will be stored in the binary. For indirect calls, the call graph will be represented in two pieces: * a non-code section, .callgraph, mapping the function signature hashes to the list of target functions entries and indirect/virtual call sites. The first word is a version number to account for potential changes in the format. * labels in .text, indicating the target function entries and indirect/virtual call sites. The .callgraph section will have pointers to these labels. This will not affect the executable code. The layout is as follows: <.callgraph section> FormatVersionNumber, TypeHash1, FuncEntryLabelCount1, .LFunc1Entry, .LFunc2Entry, ..., .LFuncNEntry, CallSiteLabelCount1, .LIndirectCallSite1, ..., .LIndirectCallSiteM, FormatVersionNumber, TypeHash2, … FuncEntryLabelCount2, ..., CallSiteLabelCount2, ..., ... <.text section> ... .LFunc1Entry: ; unique label for target function entry for func1() push %rbp ... .LIndirectCallSite1 ; unique label for indirect function call 1 callq Y(%rbp) ... Because the linker may concatenate .callgraph sections with different format version numbers, we repeat the format version number for each TypeHash list. This avoids any special logic for the .callgraph section in the linker. Restoring the call graph ======================= To extract the call graph from the binary and serialize, a new LLVM tool, llvm-discg, will be implemented using LLVM disassembler library, or a new feature will be introduced into llvm-objdump. The direct call graph will be reconstructed from the disassembly without needing the stored call graph data. The indirect call graph will be reconstructed from the .callgraph section and the disassembly. For the indirect call graph, the labels in the .callgraph section will be resolved to instruction addresses. Then, disjoint classes of function entries and call sites will be recovered based on their type hashes. For a call site with type hash H, edges will be made to all function entries with the same type hash H. Repeating the procedure for all call sites, the conservative indirect call graph will be reconstructed. Future Direction =============== The construction mode of the call graph can be parametrized to tune the trade-off between the completeness and soundness of the call graph based on additional heuristics. References ========= [1] https://clang.llvm.org/docs/AddressSanitizer.html [2] https://security.googleblog.com/2019/08/adopting-arm-memory-tagging-extension.html [3] https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanitize-cfi-icall-generalize-pointers Best, Necip Fazil Yildiran
Necip Yildiran via llvm-dev
2021-Jul-12 03:24 UTC
[llvm-dev] [RFC] Computing, storing, and restoring conservative call graphs with LLVM
Hi all, Please see the updates to the RFC below. The terms “type hash”, “type identifier”, and “type id” are used interchangeably. The original RFC describes a use case, which is efficient stack trace collection and reconstruction via offline reverse call graph traversal. Besides call graph storing and reconstruction as described in this RFC, additional LLVM features will be introduced for this use case including 1) ASan features to collect and report compressed stack traces, 2) a stand-alone tool to reconstruct stack traces based on the compressed trace and the call graph. A separate RFC will be sent shortly for these. Updates ====== - Added implementation details on computing and storing call graph (see “Implementation” section). - Added details on restoring the call graph (see “Updates on restoring the call graph” section). Proposed to introduce an option to llvm-objdump (–call-graph-info) instead of having a new stand-alone tool. Added the output layout. Implementation ============= The construction scheme will consist of the following parts: 1. Creating type ids for indirect calls and targets at clang front-end Generalized type identifiers will be computed for each indirect call and target using clang::CodeGenModule::CreateMetadataIdentifierGeneralized(). These type ids will be passed to the middle-end as metadata. The type information is deliberately computed at the front-end from C/C++ types instead of IR types due to two main reasons. First, it provides extra precision since some type information is lost at IR. Second, it ensures consistent type identifiers between object files compiled at different times since C/C++ standards require language-level types to match. 2. Propagating indirect call type ids from the middle-end (LLVM IR) to the back-end (machine instructions at AsmPrinter) with CallSiteInfo To form the call graph section, the type ids for indirect calls and targets are needed at the machine-instruction level. However, the metadata is not propagated from LLVM IR to machine instructions. For indirect targets, machine functions can be mapped back to IR functions. This enables retrieving the type id for a machine function from the metadata of the IR function. However, there is no such machine-to-IR mapping for arbitrary instructions. To address this, CallSiteInfo (originally used for mapping call arguments to forwarding registers) will be extended to include call type ids, so that, they can be retrieved at the back-end for indirect calls. 3. Creating comdats for functions whose symbols will be referenced from the call graph section Comdats will be used to enable dead-stripping of the call graph entries. An LLVM pass will be implemented and enabled to create a comdat for each function with at least one call graph entry. The entries of each function will be emitted to the section created using its comdat. As a result, the call graph entries will be discarded if the linked functions get removed. 4. Emitting call graph section The type ids will be retrieved and the call graph sections will be created at the assembly printer as described above. A special type id (0) will be assigned to indirect targets lacking the metadata (possibly due to metadata loss), so that a complete list of indirect targets can be recovered from the call graph section. This makes indirect targets distinguishable from others, which provides additional precision. Labels will be emitted for indirect targets and calls, and their values will be emitted to the call graph section as pointers to these labels. Gathering all these pieces, the call graph sections will be constructed at the assembly printer following the described layout. Updates on restoring the call graph ================================== Instead of creating a new stand-alone LLVM tool, a new feature will be introduced to llvm-objdump (--call-graph-info) to extract and serialize the call graph from the binary. This will enable code reuse for disassembly. The output will have the following layout: INDIRECT TARGETS TYPES TypeId1 FuncEntryAddr1 FuncEntryAddr2 ... FuncEntryAddrN TypeId2 ... INDIRECT CALLS TYPES TypeId1 ICallSite1 ICallSite2 ... ICallSiteN TypeId2 ... INDIRECT CALL SITES CallerFuncEntryAddr1 ICallSite1 ICallSite2 ... ICallSiteN CallerFuncEntryAddr2 ... DIRECT CALL SITES CallerFuncEntryAddr1 DCallSite1 DCallTarget1 ... DCallSiteN DCallTargetN CallerFuncEntryAddr2 ... FUNCTION SYMBOLS FuncEntryAddr1 FuncName1 FuncEntryAddr2 FuncName2 ... The information for the indirect target and call type identifiers will be retrieved from the call graph section. The call sites will be extracted from the disassembly. The function symbols will be extracted from the symbol table. If an indirect call has no type id, it will still appear in “INDIRECT CALL SITES” list but not in “INDIRECT CALL TYPES” list. In such cases, receiver edges can be drawn to all indirect targets from such call sites for completeness of the call graph, Indirect targets without a type id are assigned a special id (0) in the call graph section construction. For completeness, these functions can be taken as receiver to any indirect call. Best, Necip Fazil Yildiran On Thu, Jun 10, 2021 at 3:48 PM Necip Yildiran <necip at google.com> wrote:> Hi all, > > Please find the RFC for computing, storing, and restoring conservative > call graphs with LLVM below. > > This is an early version of the proposed feature, and we are still > exploring the ways to implement it. Comments and suggestions are > welcome. > > > Summary > ======> > Inferring indirect call targets from a binary is challenging without > source-level information. Hence, reconstruction of a fine-grained > call graph from the binary is unfeasible for indirect/virtual calls. > To address this, we propose to implement a feature in LLVM to collect > the necessary information to construct the call graph later while the > source information is present (i.e., during compilation), and store it > in a non-code section of the binary. Additionally, we propose to > implement a stand-alone tool, llvm-discg, or a feature in llvm-objdump > to extract and serialize the call graph. Together, these will allow > recovering a conservative (i.e., possibly bigger than actual to avoid > missing information) and high-precision call graph only from the > binary. > > In the following, we discuss the motivation for the proposed changes, > and describe how we plan to construct, store, and reconstruct the call > graph. > > > Background and motivation > ========================> > Source code provides rich semantic information such as types, which is > mostly lost when it is compiled to a binary. A coarse-grained call > graph can be recovered when indirect calls are allowed to target any > function in the program, resulting in poor precision. On the other > hand, source-level information can easily increase the precision of > the call graphs, e.g., by restricting the target of an indirect call > to functions with matching type. > > It is infeasible to fully reconstruct the call graph for the binary > even with the source in hand. The call graph for the source is > possibly different than for the binary due to removed or inserted > function calls, e.g., tail call optimizations or new function calls to > libc. Therefore, construction of a call graph during compilation is > favorable. Since the call graph belongs to a specific compilation of > the source, a non-code section in the binary is the right scope to > store such information. > > > Use case > =======> > Many program analysis tools (bug detectors, profilers, etc.) collect > call stack traces during execution, to later present them in > diagnostic reports. In some cases, like with AddressSanitizer [1], > stack trace collection is not a performance critical operation since > the rest of the tool’s overhead is much higher. In other cases, such > as statistical heap profilers, it is also not a performance bottleneck > since the profilers sample a small fraction of events. > > For memory safety error detectors that use the upcoming Arm Memory > Tagging Extension (Arm MTE, [2]), however, stack trace collection will > become a significant bottleneck, with respect to CPU and RAM usage, > because stack traces will need to be collected on every malloc() and > free() call. Reducing this overhead will enable collecting stack > traces and improve quality of diagnostics for MTE-enabled dynamic > analysis in production. > > Our goal is to achieve efficient stack trace collection and > reconstruction via offline reverse call graph traversal. It involves > emitting stack trace hash (instead of the full trace) during runtime; > and reconstructing the full stack trace offline from a stack trace > hash and a call graph when needed. The proposed LLVM features will > enable this by reconstructing an accurate call graph from the binary. > > > Construction of the conservative call graph > ==========================================> > The call graph for direct calls is trivial to obtain. For indirect or > virtual calls, the conservative assumption will be made: all functions > with matching signature (with relaxations, e.g., see > -fsanitize-cfi-icall-generalize-pointers [3]) will be assumed to be a > possible target. To further narrow down, a function will not be > considered as a target unless it has external linkage or its address > is taken. Further narrowing for C++ virtual calls can be done later. > > The possible targets of indirect calls will be split into disjoint > classes of unique function signatures. In the call graph, an indirect > call will have an edge to the receiver function class with matching > signature to reflect the possible targets for the call. > > The call graph will be saved in a non-code section of the binary, > i.e., .callgraph. The functions and call sites will be identified > with unique function entry and call site labels in the machine code, > which can be translated into instruction addresses. The hash of > function signatures will be used to identify and refer to the disjoint > classes of indirect targets. > > To achieve this in LLVM, a new flag will be introduced, which > constructs and emits the call graph in the compiler backend, possibly > implemented in llvm/lib/codeGen/AsmPrinter/AsmPrinter.cpp. > > > Call graph layout in the binary > ==============================> > For direct calls, no additional information will be stored in the > binary. For indirect calls, the call graph will be represented in two > pieces: > > * a non-code section, .callgraph, mapping the function signature > hashes to the list of target functions entries and indirect/virtual > call sites. The first word is a version number to account for > potential changes in the format. > > * labels in .text, indicating the target function entries and > indirect/virtual call sites. The .callgraph section will have pointers > to these labels. This will not affect the executable code. > > > The layout is as follows: > > <.callgraph section> > FormatVersionNumber, TypeHash1, > FuncEntryLabelCount1, .LFunc1Entry, .LFunc2Entry, ..., .LFuncNEntry, > CallSiteLabelCount1, .LIndirectCallSite1, ..., .LIndirectCallSiteM, > FormatVersionNumber, TypeHash2, … > FuncEntryLabelCount2, ..., > CallSiteLabelCount2, ..., > ... > > <.text section> > ... > .LFunc1Entry: ; unique label for target function entry for > func1() > push %rbp > ... > > .LIndirectCallSite1 ; unique label for indirect function call 1 > callq Y(%rbp) > ... > > > Because the linker may concatenate .callgraph sections with different > format version numbers, we repeat the format version number for each > TypeHash list. This avoids any special logic for the .callgraph > section in the linker. > > > Restoring the call graph > =======================> > To extract the call graph from the binary and serialize, a new LLVM > tool, llvm-discg, will be implemented using LLVM disassembler library, > or a new feature will be introduced into llvm-objdump. The direct > call graph will be reconstructed from the disassembly without needing > the stored call graph data. The indirect call graph will be > reconstructed from the .callgraph section and the disassembly. > > For the indirect call graph, the labels in the .callgraph section will > be resolved to instruction addresses. Then, disjoint classes of > function entries and call sites will be recovered based on their type > hashes. For a call site with type hash H, edges will be made to all > function entries with the same type hash H. Repeating the procedure > for all call sites, the conservative indirect call graph will be > reconstructed. > > > Future Direction > ===============> > The construction mode of the call graph can be parametrized to tune > the trade-off between the completeness and soundness of the call graph > based on additional heuristics. > > > References > =========> > [1] https://clang.llvm.org/docs/AddressSanitizer.html > [2] > https://security.googleblog.com/2019/08/adopting-arm-memory-tagging-extension.html > [3] > https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanitize-cfi-icall-generalize-pointers > > > Best, > Necip Fazil Yildiran >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210711/c01ad3df/attachment.html>