Alex L
2014-Jun-17 00:08 UTC
[LLVMdev] Proposal: add support for profile instrumentation based code coverage
I've been investigating better support for code coverage in Clang/LLVM. I've spent some time experimenting with several prototypes and have arrived at this proposal. I'd like to get some feedback on the overall approach before I start submitting patches. Background: ----------- Right now LLVM supports code coverage testing using llvm-cov, which is currently intended to work like gcov and reuses the same data files. However, the recent addition of instrumentation based profiling to clang (the -fprofile-instr-generate command line option) enabled us to do a much more accurate code coverage for the languages supported by clang. The counters are emitted in the frontend, and therefore we can produce an accurate mapping from the source ranges extracted out of the AST to the corresponding profiling counters generated by the frontend. Proposal: --------- I would like to add a language independent coverage mapping data format to LLVM. This mapping data would be generated by a frontend such as clang, and embedded into LLVM IR in the form of a binary blob which will get a separate object file section. Then, a tool like llvm-cov would extract this data from the object file, load the gathered profiling counter values obtained after a doing an instrumented run, and display accurate code coverage information for the desired portion of the program's source code. Here are my main design goals: * The format should strive to be as language independent as possible so that it could be reused by frontends other than clang. * The format should enable the frontend to provide accurate source ranges for the source constructs with different execution counts. * The format should enable the frontend to minimize the size of the coverage mapping data (crucial for clang), but it shouldn't restrict it to just size minimization - very accurate (thus somewhat redundant) per statement or per expression coverage should be possible for the frontends that desire it. * The format should enable accurate code coverage for C's macros and includes embedded inside the source code of a function. * The format should have enough data so that the tools that will work with it will be able to show accurate locations for the different execution counts, highlight the various source ranges based on their execution counts, and show expansions of macros and embedded includes with accurate coverage for the source code inside the macro/embedded include. In addition to the mapping format, I would like to add support for the generation of the coverage mapping data to clang and support for the analysis of this data to llvm-cov. Summary of the Coverage Mapping Format: --------------------------------------- Each function instrumented by a frontend like clang will emit mapping information that the coverage tool can use to map between the source code's regions and counter values, which are obtained during an instrumented run. The mapping information for an instrumented function contains the following arrays: * Counter Expression Array * Mapping Region Array * Optional Removed Region Array * File Name Array + Optional Virtual File Mapping Array. * Optional Expansion Region Array The mapping information is composed of the following data structures: * Counter: A counter is an abstract value that on its own, or as a part of a counter expression is able to produce an integer value which represents the number of times that a particular region of code has been executed. There are three types of counters: reference to the actual counter value from the instrumented run, reference to a counter expression, and zero. A reference to the counter value stores an integer index into the array with the gathered counter values. A reference to the expression stores an integer index into the expression array. Because the coverage tool should be able to resolve counter value references to the actual counter values from the instrumented run, we also need to know the function’s name, as it acts as a key when looking up the values in a file. In clang, it is possible to reuse the function name that is already stored by the profile instrumentation (i.e. __llvm_profile_name_???). * Counter Expression: An expression is a special kind of counter that represents an arithmetic operation on two counters (expressions as well, as counters can be references to expressions). There are two types of expressions: integer addition and integer subtraction. Subtraction expressions represent constructs like the else part of clang's if construct (i.e. Execution count of the else region = Execution count of the if's parent region - Execution count of the if's then region). Addition expressions represent constructs that fall through to other constructs, like multiple case statements in a switch that follow each other, or don't have a break statement. Expressions are distinct, so the frontend won't emit redundant expressions that duplicate each other. The coverage mapping library will provide a builder class that will fold the same expressions into one. * Mapping Region: The primary goal of a mapping region is to associate a source range with a counter. The source range includes a file id, the starting and ending locations (line and column), and the counter. A mapping region that is unreachable can be represented by a counter zero. The mapping regions are stored in a sorted order (by file id and starting location) to achieve more compact binary representation (discussed later). * Removed Region A removed region represents a region of code that doesn't have an execution count. It is needed so that the coverage tool can show that it's not executed. It includes a file id, and the starting and ending locations. An example of a removed region is a block of code that is removed by C's preprocessor in an #if or similar preprocessor constructs. For more details on why we need removed regions, please see the Clang Specific Details section. * File Name + Virtual File Mapping A mapping region is useless to a coverage tool unless we know what file it is in. As some frontends like clang can have mapping regions that come from different files but are in the same function, we need to store a file id with each mapping/removed/expansion region. The file id is an index into the virtual file mapping array. Thus the file id is a actually a virtual file id, as it might not correspond to a proper source code file, because the file ids can also represent a particular expansion of a C macro or an embedded include. The virtual file mapping array contains indices to the file name array that stores unique absolute file names of the program's source files. The file name array contains only the file names that have actual source code for this function. The file name and the virtual file mapping arrays are stored for each function. In order to achieve more compact representation the file name array might be implemented as a per TU/LLVM module array instead of a per function array, or it might somehow store references to the file names in the debug information instead of the strings containing the actual file names. The virtual file mapping array is optional, because when it represents an identity mapping from the file id's to the file names, it is not stored in the mapping data. * Expansion Region An expansion region represents an expansion of a virtual file (i.e. the use of a macro or an embedded textual include). It includes an expanded file id, a file id, and the starting and ending locations. It's primarily intended to be used for C's macros and embedded includes, but can be used by the languages which have similar features (e.g. Fortran with its INCLUDE statement). The expanded file id can used by the code coverage tool to gather and display the mapping regions which are produced for this particular expansion of a macro or an embedded include. Compact Binary Representation: ------------------------------ The size of the coverage mapping data is one of our biggest concerns. For that reason I propose the following compacting measures: * All integer values are stored as unsigned integers using DWARF's LEB128 encoding. This is a variable length encoding format, which is very space efficient for small numbers (1 byte for integers less than 128). * The counter is stored as one unsigned integer. The 2 lowest bits are taken for the counter tag, which distinguishes between counter references, addition expression references, subtraction expression references, and zero. The other bits are used for the counter index or the expression index. * The counter expressions are stored as just two counters. * Because the mapping region array is sorted by file id and by starting location, we can store the mapping region array as a sequence of subarrays without any file id information at all (file id is inferred from the index of the subarray in the sequence). * The source locations for the subarrays of mapping regions are stored in the following way: Starting line is stored as an unsigned integer representing the difference between the current starting line and the starting line of the previous mapping region. Ending line is stored as an unsigned integer representing the difference between the current ending line and the current starting line. Starting and ending columns are stored as unsigned integers. The mapping regions are sorted, thus most of the starting line deltas will be hopefully small enough (<=127) to fit in one byte. * The virtual file mapping isn't stored if the mapping represents an identity mapping from the file id's to the filenames. Clang Specific Details: ----------------------- Clang doesn't emit and therefore doesn't instrument some functions - in particular the unused static functions and the unused class methods defined inside class definition or defined outside of the class definition with the inline attribute. For those function, we can emit a special coverage mapping information which will contain a mapping region containing the full source range of the function and the counter zero. It might be also possible to use the same approach for the template functions without any specializations. Clang will try to minimize the size of the coverage data as much as possible. For this reason, instead of producing a mapping region for each statement/expression, clang will produce nested regions which correspond to the starting and ending locations of source constructs like the body of a function or the body of a loop. For this reasons, the coverage tool will show an execution counter even for empty/comment lines contained inside those regions. This is also the reason why clang needs the removed regions, because without them, the nested regions containing the removed blocks of code will tell the coverage tool that this code was actually executed. Summary of the LLVM Coverage Tool: ---------------------------------- At the moment llvm-cov is intended to work like gcov. With the new approach to code coverage, a command line flag or an instruction (the first command line argument) will have to be added to distinguish between the code coverage with the new approach and the old one. Alternatively, a new tool can be created. The coverage tool will require to know the name of the object file and the name of the file with the counter values in order to show code coverage details for a portion of the program's source code. Those two names will typically be passed as command line arguments. The coverage tool will have several ways to determine for which portion of the program's source code should code coverage details be displayed. The user of the tool can supply the source code filenames so that it will show the code coverage statistics for those files. If no filenames are supplied, the tool will show code coverage details for all the known source code files in that program. Alternatively, a user can supply a name of a function or a regular expression that will display code coverage for the matching function(s) only. The tool will display code coverage details using several approaches which can be combined as desired: * By default, a per-line approach will be used, where each line that is covered by a mapping region will display the number of times that this line was executed. In the cases where there are multiple mapping regions on one line, the execution count of the last one will be displayed. * Alternatively, instead of showing per-line execution counts, each line that has a start of a mapping region will show the number of times that this region was executed underneath this line. This will allow us to show execution counts for multiple mapping regions on the same line. * The regions of code that have the execution count of zero will be highlighted with a red background colour. * The expansion regions can be expanded to show the coverage details for the corresponding source code inside macros/embedded includes. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140616/66f06dba/attachment.html>