Xinliang David Li via llvm-dev
2020-Jul-05 20:43 UTC
[llvm-dev] RFC: Sanitizer-based Heap Profiler
On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:> This sounds very useful. We’ve improved and used memoro > <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and > analysis, and we are also looking for ways to leverage memory profile for > PGO/FDO. I think having a common profiling infrastructure for analysis > tooling as well as profile guided optimizations is good design, and having > it in LLVM is also helpful. Very interested in the tooling and optimization > that comes after the profiler. > > > > Two questions: > > - How does the profiling overhead look? Is that similar to ASAN > overhead from what you’ve seen, which would be higher than PGO > instrumentation? Asking because I’m wondering if any PGO training setup can > be used directly for the new heap profiling. > >It is built on top of ASAN runtime, but the overhead can be made much lower by using counter update consolidation -- all fields sharing the same shadow counter can be merged, and aggressive loop sinking/hoisting can be done. The goal is to integrate this with the PGO instrumentation. The PGO instrumentation overhead can be further reduced with sampling technique (Rong Xu has a patch to be submitted).> - > - I’m not familiar with how sanitizer handles stack trace, but for > getting most accurate calling context (use FP rather than dwarf), I guess > frame pointer omission and tail call opt etc. need to be turned off? Is > that going to be implied by -fheapprof? > >Kostya can provide detailed answers to these questions. David> > - > > > > Thanks, > > Wenlei > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa > Johnson via llvm-dev <llvm-dev at lists.llvm.org> > *Reply-To: *Teresa Johnson <tejohnson at google.com> > *Date: *Wednesday, June 24, 2020 at 4:58 PM > *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < > kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < > vitalybuka at google.com> > *Cc: *David Li <davidxl at google.com> > *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler > > > > Hi all, > > > > I've included an RFC for a heap profiler design I've been working on in > conjunction with David Li. Please send any questions or feedback. For > sanitizer folks, one area of feedback is on refactoring some of the *ASAN > shadow setup code (see the Shadow Memory section). > > > > Thanks, > > Teresa > > > > RFC: Sanitizer-based Heap Profiler > Summary > > This document provides an overview of an LLVM Sanitizer-based heap > profiler design. > Motivation > > The objective of heap memory profiling is to collect critical runtime > information associated with heap memory references and information on heap > memory allocations. The profile information will be used first for tooling, > and subsequently to guide the compiler optimizer and allocation runtime to > layout heap objects with improved spatial locality. As a result, DTLB and > cache utilization will be improved, and program IPC (performance) will be > increased due to reduced TLB and cache misses. More details on the heap > profile guided optimizations will be shared in the future. > Overview > > The profiler is based on compiler inserted instrumentation of load and > store accesses, and utilizes runtime support to monitor heap allocations > and profile data. The target consumer of the heap memory profile > information is initially tooling and ultimately automatic data layout > optimizations performed by the compiler and/or allocation runtime (with the > support of new allocation runtime APIs). > > > > Each memory address is mapped to Shadow Memory > <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, > similar to the approach used by the Address Sanitizer > <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). > Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the > heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow > location implements the profile counter (incremented on accesses to the > corresponding memory). This granularity was chosen to help avoid counter > overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. > To avoid aliasing of shadow memory for different allocations, we must > choose a minimum alignment carefully. As discussed further below, we can > attain a 32-byte minimum alignment, instead of a 64-byte alignment, by > storing necessary heap information for each allocation in a 32-byte header > block. > > > > The compiler instruments each load and store to increment the associated > shadow memory counter, in order to determine hotness. > > > > The heap profiler runtime is responsible for tracking allocations and > deallocations, including the stack at each allocation, and information such > as the allocation size and other statistics. I have implemented a prototype > built using a stripped down and modified version of ASAN, however this will > be a separate library utilizing sanitizer_common components. > Compiler > > A simple HeapProfiler instrumentation pass instruments interesting memory > accesses (loads, stores, atomics), with a simple load, increment, store of > the associated shadow memory location (computed via a mask and shift to do > the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). > The handling is very similar to and based off of the ASAN instrumentation > pass, with slightly different instrumentation code. > > > > Various techniques can be used to reduce the overhead, by aggressively > coalescing counter updates (e.g. given the 32-byte alignment, accesses > known to be in the same 32-byte block, or across possible aliases since we > don’t care about the dereferenced values). > > > > Additionally, the Clang driver needs to set up to link with the runtime > library, much as it does with the sanitizers. > > > > A -fheapprof option is added to enable the instrumentation pass and > runtime library linking. Similar to -fprofile-generate, -fheapprof will > accept an argument specifying the directory in which to write the profile. > Runtime > > The heap profiler runtime is responsible for tracking and reporting > information about heap allocations and accesses, aggregated by allocation > calling context. For example, the hotness, lifetime, and cpu affinity. > > > > A new heapprof library will be created within compiler-rt. It will > leverage support within sanitizer_common, which already contains facilities > like stack context tracking, needed by the heap profiler. > Shadow Memory > > There are some basic facilities in sanitizer_common for mmap’ing the > shadow memory, but most of the existing setup lives in the ASAN and HWASAN > libraries. In the case of ASAN, there is support for both statically > assigned shadow offsets (the default on most platforms), and for > dynamically assigned shadow memory (implemented for Windows and currently > also used for Android and iOS). According to kcc, recent experiments show > that the performance with a dynamic shadow is close to that with a static > mapping. In fact, that is the only approach currently used by HWASAN. Given > the simplicity, the heap profiler will be implemented with a dynamic shadow > as well. > > > > There are a number of functions in ASAN and HWASAN related to setup of the > shadow that are duplicated but very nearly identical, at least for linux > (which seems to be the only OS flavor currently supported for HWASAN). E.g. > ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN > there is another nearly identical copy in PremapShadow, used by Android, > whereas in HW ASAN the premap handling is already commoned with the > non-premap handling). Rather than make yet another copy of these > mechanisms, I propose refactoring them into sanitizer_common versions. Like > HWASAN, the initial version of the heap profiler will be supported for > linux only, but other OSes can be added as needed similar to ASAN. > StackTrace and StackDepot > > The sanitizer already contains support for obtaining and representing a > stack trace in a StackTrace object, and storing it in the StackDepot which > “efficiently stores huge amounts of stack traces”. This is in the > sanitizer_common subdirectory and the support is shared by ASAN and > ThreadSanitizer. The StackDepot is essentially an unbounded hash table, > where each StackTrace is assigned a unique id. ASAN stores this id in the > alloc_context_id field in each ChunkHeader (in the redzone preceding each > allocation). Additionally, there is support for symbolizing and printing > StackTrace objects. > ChunkHeader > > The heap profiler needs to track several pieces of information for each > allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve > a minimum of 32-byte alignment by holding this information in a 32-byte > header block preceding each allocation. > > > > In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains > information about the current allocation state, user requested size, > allocation and free thread ids, the allocation context id (representing the > call stack at allocation, assigned by the StackDepot as described above), > and misc other bookkeeping. For heap profiling, this will be converted to a > 32-byte header block. > > > > Note that we could instead use the metadata section, similar to other > sanitizers, which is stored in a separate location. However, as described > above, storing the header block with each allocation enables 32-byte > alignment without aliasing shadow counters for the same 64 bytes of memory. > > > > In the prototype heap profiler implementation, the header contains the > following fields: > > > > // Should be 32 bytes > > struct ChunkHeader { > > // 1-st 4 bytes > > // Carry over from ASAN (available, allocated, quarantined). Will be > > // reduced to 1 bit (available or allocated). > > u32 chunk_state : 8; > > // Carry over from ASAN. Used to determine the start of user allocation. > > u32 from_memalign : 1; > > // 23 bits available > > > > // 2-nd 4 bytes > > // Carry over from ASAN (comment copied verbatim). > > // This field is used for small sizes. For large sizes it is equal to > > // SizeClassMap::kMaxSize and the actual size is stored in the > > // SecondaryAllocator's metadata. > > u32 user_requested_size : 29; > > > > // 3-rd 4 bytes > > u32 cpu_id; // Allocation cpu id > > > > // 4-th 4 bytes > > // Allocation timestamp in ms from a baseline timestamp computed at > > // the start of profiling (to keep this within 32 bits). > > u32 timestamp_ms; > > > > // 5-th and 6-th 4 bytes > > // Carry over from ASAN. Used to identify allocation stack trace. > > u64 alloc_context_id; > > > > // 7-th and 8-th 4 bytes > > // UNIMPLEMENTED in prototype - needs instrumentation and IR support. > > u64 data_type_id; // hash of type name > > }; > > As noted, the chunk state can be reduced to a single bit (no need for > quarantined memory in the heap profiler). The header contains a placeholder > for the data type hash, which is not yet implemented as it needs > instrumentation and IR support. > Heap Info Block (HIB) > > On a deallocation, information from the corresponding shadow block(s) and > header are recorded in a Heap Info Block (HIB) object. The access count is > computed from the shadow memory locations for the allocation, as well as > the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero > 8-byte shadow locations for the whole allocation). Other information such > as the deallocation timestamp (for lifetime computation) and deallocation > cpu id (to determine migrations) are recorded along with the information in > the chunk header recorded on allocation. > > > > The prototyped HIB object tracks the following: > > > > struct HeapInfoBlock { > > // Total allocations at this stack context > > u32 alloc_count; > > // Access count computed from all allocated 64-byte blocks (track total > > // across all allocations, and the min and max). > > u64 total_access_count, min_access_count, max_access_count; > > // Allocated size (track total across all allocations, and the min and > max). > > u64 total_size; > > u32 min_size, max_size; > > // Lifetime (track total across all allocations, and the min and max). > > u64 total_lifetime; > > u32 min_lifetime, max_lifetime; > > // Percent utilization of allocated 64-byte blocks (track total > > // across all allocations, and the min and max). The utilization is > > // defined as the percentage of 8-byte shadow counters corresponding to > > // the full allocation that are non-zero. > > u64 total_percent_utilized; > > u32 min_percent_utilized, max_percent_utilized; > > // Allocation and deallocation timestamps from the most recent merge into > > // the table with this stack context. > > u32 alloc_timestamp, dealloc_timestamp; > > // Allocation and deallocation cpu ids from the most recent merge into > > // the table with this stack context. > > u32 alloc_cpu_id, dealloc_cpu_id; > > // Count of allocations at this stack context that had a different > > // allocation and deallocation cpu id. > > u32 num_migrated_cpu; > > // Number of times the lifetime of the entry being merged had its > lifetime > > // overlap with the previous entry merged with this stack context (by > > // comparing the new alloc/dealloc timestamp with the one last recorded > in > > // the entry in the table. > > u32 num_lifetime_overlaps; > > // Number of times the alloc/dealloc cpu of the entry being merged was > the > > // same as that of the previous entry merged with this stack context > > u32 num_same_alloc_cpu; > > u32 num_same_dealloc_cpu; > > // Hash of type name (UNIMPLEMENTED). This needs instrumentation support > and > > // possibly IR changes. > > u64 data_type_id; > > } > HIB Table > > The Heap Info Block Table, which is a multi-way associative cache, holds > HIB objects from deallocated objects. It is indexed by the stack allocation > context id from the chunk header, and currently utilizes a simple mod with > a prime number close to a power of two as the hash (because of the way the > stack context ids are assigned, a mod of a power of two performs very > poorly). Thus far, only 4-way associativity has been evaluated. > > > > HIB entries are added or merged into the HIB Table on each deallocation. > If an entry with a matching stack alloc context id is found in the Table, > the newly deallocated information is merged into the existing entry. Each > HIB Table entry currently tracks the min, max and total value of the > various fields for use in computing and reporting the min, max and average > when the Table is ultimately dumped. > > > > If no entry with a matching stack alloc context id is found, a new entry > is created. If this causes an eviction, the evicted entry is dumped > immediately (by default to stderr, otherwise to a specified report file). > Later post processing can merge dumped entries with the same stack alloc > context id. > Initialization > > > > For ASAN, an __asan_init function initializes the memory allocation > tracking support, and the ASAN instrumentation pass in LLVM creates a > global constructor to invoke it. The heap profiler prototype adds a new > __heapprof_init function, which performs heap profile specific > initialization, and the heap profile instrumentation pass calls this new > init function instead by a generated global constructor. It currently > additionally invokes __asan_init since we are leveraging a modified ASAN > runtime. Eventually, this should be changed to initialize refactored common > support. > > > > Note that __asan init is also placed in the .preinit_array when it is > available, so it is invoked even earlier than global constructors. > Currently, it is not possible to do this for __heapprof_init, as it calls > timespec_get in order to get a baseline timestamp (as described in the > ChunkHeader comments the timestamps (ms) are actually offsets from the > baseline timestamp, in order to fit into 32 bits), and system calls cannot > be made that early (dl_init is not complete). Since the constructor > priority is 1, it should be executed early enough that there are very few > allocations before it runs, and likely the best solution is to simply > ignore any allocations before initialization. > Dumping > > For the prototype, the profile is dumped as text with a compact raw format > to limit its size. Ultimately it should be dumped in a more compact binary > format (i.e. into a different section of the raw instrumentation based > profile, with llvm-profdata performing post-processing) which is TBD. > HIB Dumping > > As noted earlier, HIB Table entries are created as memory is deallocated. > At the end of the run (or whenever dumping is requested, discussed later), > HIB entries need to be created for allocations that are still live. > Conveniently, the sanitizer allocator already contains a mechanism to walk > through all chunks of memory it is tracking (ForEachChunk). The heap > profiler simply looks for all chunks with a chunk state of allocated, and > creates a HIB the same as would be done on deallocation, adding each to the > table. > > > > A HIB Table mechanism for printing each entry is then invoked. > > > > By default, the dumping occurs: > > - on evictions > - full table at exit (when the static Allocator object is destructed) > > > > For running in a load testing scenario, we will want to add a mechanism to > provoke finalization (merging currently live allocations) and dumping of > the HIB Table before exit. This would be similar to the __llvm_profile_dump > facility used for normal PGO counter dumping. > Stack Trace Dumping > > There is existing support for dumping symbolized StackTrace objects. A > wrapper to dump all StackTrace objects in the StackDepot will be added. > This new interface is invoked just after the HIB Table is dumped (on exit > or via dumping interface). > Memory Map Dumping > > In cases where we may want to symbolize as a post processing step, we may > need the memory map (from /proc/self/smaps). Specifically, this is needed > to symbolize binaries using ASLR (Address Space Layout Randomization). > There is already support for reading this file and dumping it to the > specified report output file (DumpProcessMap()). This is invoked when the > profile output file is initialized (HIB Table construction), so that the > memory map is available at the top of the raw profile. > Current Status and Next Steps > > > > As mentioned earlier, I have a working prototype based on a simplified > stripped down version of ASAN. My current plan is to do the following: > > 1. Refactor out some of the shadow setup code common between ASAN and > HWASAN into sanitizer_common. > 2. Rework my prototype into a separate heapprof library in > compiler-rt, using sanitizer_common support where possible, and send > patches for review. > 3. Send patches for the heap profiler instrumentation pass and related > clang options. > 4. Design/implement binary profile format > > > > -- > > Teresa Johnson | > > Software Engineer | > > tejohnson at google.com | > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/51c0a31c/attachment-0001.html>
Teresa Johnson via llvm-dev
2020-Jul-06 14:58 UTC
[llvm-dev] RFC: Sanitizer-based Heap Profiler
Hi Wenlei, Thanks for the comments! David answered the first question, I do have some comments on the second one though. Teresa On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> wrote:> > > On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote: > >> This sounds very useful. We’ve improved and used memoro >> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and >> analysis, and we are also looking for ways to leverage memory profile for >> PGO/FDO. I think having a common profiling infrastructure for analysis >> tooling as well as profile guided optimizations is good design, and having >> it in LLVM is also helpful. Very interested in the tooling and optimization >> that comes after the profiler. >> >> >> >> Two questions: >> >> - How does the profiling overhead look? Is that similar to ASAN >> overhead from what you’ve seen, which would be higher than PGO >> instrumentation? Asking because I’m wondering if any PGO training setup can >> be used directly for the new heap profiling. >> >> > It is built on top of ASAN runtime, but the overhead can be made much > lower by using counter update consolidation -- all fields sharing the same > shadow counter can be merged, and aggressive loop sinking/hoisting can be > done. > > The goal is to integrate this with the PGO instrumentation. The PGO > instrumentation overhead can be further reduced with sampling technique > (Rong Xu has a patch to be submitted). > > >> - >> - I’m not familiar with how sanitizer handles stack trace, but for >> getting most accurate calling context (use FP rather than dwarf), I guess >> frame pointer omission and tail call opt etc. need to be turned off? Is >> that going to be implied by -fheapprof? >> >> > Kostya can provide detailed answers to these questions. >I'm not aware that -fsanitizer* options disable these, but I know in our environment we do disable frame pointer omission when setting up ASAN builds, and I am arranging for heap profiling builds to do the same. Not sure whether we want to do this within clang itself, would be interested in Kostya's opinion. I can't see anywhere that we are disabling tail call optimizations for ASAN though, but I might have missed it. Thanks, Teresa> > David > >> >> - >> >> >> >> Thanks, >> >> Wenlei >> >> >> >> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa >> Johnson via llvm-dev <llvm-dev at lists.llvm.org> >> *Reply-To: *Teresa Johnson <tejohnson at google.com> >> *Date: *Wednesday, June 24, 2020 at 4:58 PM >> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < >> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < >> vitalybuka at google.com> >> *Cc: *David Li <davidxl at google.com> >> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler >> >> >> >> Hi all, >> >> >> >> I've included an RFC for a heap profiler design I've been working on in >> conjunction with David Li. Please send any questions or feedback. For >> sanitizer folks, one area of feedback is on refactoring some of the *ASAN >> shadow setup code (see the Shadow Memory section). >> >> >> >> Thanks, >> >> Teresa >> >> >> >> RFC: Sanitizer-based Heap Profiler >> Summary >> >> This document provides an overview of an LLVM Sanitizer-based heap >> profiler design. >> Motivation >> >> The objective of heap memory profiling is to collect critical runtime >> information associated with heap memory references and information on heap >> memory allocations. The profile information will be used first for tooling, >> and subsequently to guide the compiler optimizer and allocation runtime to >> layout heap objects with improved spatial locality. As a result, DTLB and >> cache utilization will be improved, and program IPC (performance) will be >> increased due to reduced TLB and cache misses. More details on the heap >> profile guided optimizations will be shared in the future. >> Overview >> >> The profiler is based on compiler inserted instrumentation of load and >> store accesses, and utilizes runtime support to monitor heap allocations >> and profile data. The target consumer of the heap memory profile >> information is initially tooling and ultimately automatic data layout >> optimizations performed by the compiler and/or allocation runtime (with the >> support of new allocation runtime APIs). >> >> >> >> Each memory address is mapped to Shadow Memory >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, >> similar to the approach used by the Address Sanitizer >> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). >> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the >> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow >> location implements the profile counter (incremented on accesses to the >> corresponding memory). This granularity was chosen to help avoid counter >> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. >> To avoid aliasing of shadow memory for different allocations, we must >> choose a minimum alignment carefully. As discussed further below, we can >> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by >> storing necessary heap information for each allocation in a 32-byte header >> block. >> >> >> >> The compiler instruments each load and store to increment the associated >> shadow memory counter, in order to determine hotness. >> >> >> >> The heap profiler runtime is responsible for tracking allocations and >> deallocations, including the stack at each allocation, and information such >> as the allocation size and other statistics. I have implemented a prototype >> built using a stripped down and modified version of ASAN, however this will >> be a separate library utilizing sanitizer_common components. >> Compiler >> >> A simple HeapProfiler instrumentation pass instruments interesting memory >> accesses (loads, stores, atomics), with a simple load, increment, store of >> the associated shadow memory location (computed via a mask and shift to do >> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). >> The handling is very similar to and based off of the ASAN instrumentation >> pass, with slightly different instrumentation code. >> >> >> >> Various techniques can be used to reduce the overhead, by aggressively >> coalescing counter updates (e.g. given the 32-byte alignment, accesses >> known to be in the same 32-byte block, or across possible aliases since we >> don’t care about the dereferenced values). >> >> >> >> Additionally, the Clang driver needs to set up to link with the runtime >> library, much as it does with the sanitizers. >> >> >> >> A -fheapprof option is added to enable the instrumentation pass and >> runtime library linking. Similar to -fprofile-generate, -fheapprof will >> accept an argument specifying the directory in which to write the profile. >> Runtime >> >> The heap profiler runtime is responsible for tracking and reporting >> information about heap allocations and accesses, aggregated by allocation >> calling context. For example, the hotness, lifetime, and cpu affinity. >> >> >> >> A new heapprof library will be created within compiler-rt. It will >> leverage support within sanitizer_common, which already contains facilities >> like stack context tracking, needed by the heap profiler. >> Shadow Memory >> >> There are some basic facilities in sanitizer_common for mmap’ing the >> shadow memory, but most of the existing setup lives in the ASAN and HWASAN >> libraries. In the case of ASAN, there is support for both statically >> assigned shadow offsets (the default on most platforms), and for >> dynamically assigned shadow memory (implemented for Windows and currently >> also used for Android and iOS). According to kcc, recent experiments show >> that the performance with a dynamic shadow is close to that with a static >> mapping. In fact, that is the only approach currently used by HWASAN. Given >> the simplicity, the heap profiler will be implemented with a dynamic shadow >> as well. >> >> >> >> There are a number of functions in ASAN and HWASAN related to setup of >> the shadow that are duplicated but very nearly identical, at least for >> linux (which seems to be the only OS flavor currently supported for >> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and >> FindDynamicShadowStart (in ASAN there is another nearly identical copy in >> PremapShadow, used by Android, whereas in HW ASAN the premap handling is >> already commoned with the non-premap handling). Rather than make yet >> another copy of these mechanisms, I propose refactoring them into >> sanitizer_common versions. Like HWASAN, the initial version of the heap >> profiler will be supported for linux only, but other OSes can be added as >> needed similar to ASAN. >> StackTrace and StackDepot >> >> The sanitizer already contains support for obtaining and representing a >> stack trace in a StackTrace object, and storing it in the StackDepot which >> “efficiently stores huge amounts of stack traces”. This is in the >> sanitizer_common subdirectory and the support is shared by ASAN and >> ThreadSanitizer. The StackDepot is essentially an unbounded hash table, >> where each StackTrace is assigned a unique id. ASAN stores this id in the >> alloc_context_id field in each ChunkHeader (in the redzone preceding each >> allocation). Additionally, there is support for symbolizing and printing >> StackTrace objects. >> ChunkHeader >> >> The heap profiler needs to track several pieces of information for each >> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve >> a minimum of 32-byte alignment by holding this information in a 32-byte >> header block preceding each allocation. >> >> >> >> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It >> contains information about the current allocation state, user requested >> size, allocation and free thread ids, the allocation context id >> (representing the call stack at allocation, assigned by the StackDepot as >> described above), and misc other bookkeeping. For heap profiling, this will >> be converted to a 32-byte header block. >> >> >> >> Note that we could instead use the metadata section, similar to other >> sanitizers, which is stored in a separate location. However, as described >> above, storing the header block with each allocation enables 32-byte >> alignment without aliasing shadow counters for the same 64 bytes of memory. >> >> >> >> In the prototype heap profiler implementation, the header contains the >> following fields: >> >> >> >> // Should be 32 bytes >> >> struct ChunkHeader { >> >> // 1-st 4 bytes >> >> // Carry over from ASAN (available, allocated, quarantined). Will be >> >> // reduced to 1 bit (available or allocated). >> >> u32 chunk_state : 8; >> >> // Carry over from ASAN. Used to determine the start of user allocation. >> >> u32 from_memalign : 1; >> >> // 23 bits available >> >> >> >> // 2-nd 4 bytes >> >> // Carry over from ASAN (comment copied verbatim). >> >> // This field is used for small sizes. For large sizes it is equal to >> >> // SizeClassMap::kMaxSize and the actual size is stored in the >> >> // SecondaryAllocator's metadata. >> >> u32 user_requested_size : 29; >> >> >> >> // 3-rd 4 bytes >> >> u32 cpu_id; // Allocation cpu id >> >> >> >> // 4-th 4 bytes >> >> // Allocation timestamp in ms from a baseline timestamp computed at >> >> // the start of profiling (to keep this within 32 bits). >> >> u32 timestamp_ms; >> >> >> >> // 5-th and 6-th 4 bytes >> >> // Carry over from ASAN. Used to identify allocation stack trace. >> >> u64 alloc_context_id; >> >> >> >> // 7-th and 8-th 4 bytes >> >> // UNIMPLEMENTED in prototype - needs instrumentation and IR support. >> >> u64 data_type_id; // hash of type name >> >> }; >> >> As noted, the chunk state can be reduced to a single bit (no need for >> quarantined memory in the heap profiler). The header contains a placeholder >> for the data type hash, which is not yet implemented as it needs >> instrumentation and IR support. >> Heap Info Block (HIB) >> >> On a deallocation, information from the corresponding shadow block(s) and >> header are recorded in a Heap Info Block (HIB) object. The access count is >> computed from the shadow memory locations for the allocation, as well as >> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero >> 8-byte shadow locations for the whole allocation). Other information such >> as the deallocation timestamp (for lifetime computation) and deallocation >> cpu id (to determine migrations) are recorded along with the information in >> the chunk header recorded on allocation. >> >> >> >> The prototyped HIB object tracks the following: >> >> >> >> struct HeapInfoBlock { >> >> // Total allocations at this stack context >> >> u32 alloc_count; >> >> // Access count computed from all allocated 64-byte blocks (track total >> >> // across all allocations, and the min and max). >> >> u64 total_access_count, min_access_count, max_access_count; >> >> // Allocated size (track total across all allocations, and the min and >> max). >> >> u64 total_size; >> >> u32 min_size, max_size; >> >> // Lifetime (track total across all allocations, and the min and max). >> >> u64 total_lifetime; >> >> u32 min_lifetime, max_lifetime; >> >> // Percent utilization of allocated 64-byte blocks (track total >> >> // across all allocations, and the min and max). The utilization is >> >> // defined as the percentage of 8-byte shadow counters corresponding to >> >> // the full allocation that are non-zero. >> >> u64 total_percent_utilized; >> >> u32 min_percent_utilized, max_percent_utilized; >> >> // Allocation and deallocation timestamps from the most recent merge >> into >> >> // the table with this stack context. >> >> u32 alloc_timestamp, dealloc_timestamp; >> >> // Allocation and deallocation cpu ids from the most recent merge into >> >> // the table with this stack context. >> >> u32 alloc_cpu_id, dealloc_cpu_id; >> >> // Count of allocations at this stack context that had a different >> >> // allocation and deallocation cpu id. >> >> u32 num_migrated_cpu; >> >> // Number of times the lifetime of the entry being merged had its >> lifetime >> >> // overlap with the previous entry merged with this stack context (by >> >> // comparing the new alloc/dealloc timestamp with the one last recorded >> in >> >> // the entry in the table. >> >> u32 num_lifetime_overlaps; >> >> // Number of times the alloc/dealloc cpu of the entry being merged was >> the >> >> // same as that of the previous entry merged with this stack context >> >> u32 num_same_alloc_cpu; >> >> u32 num_same_dealloc_cpu; >> >> // Hash of type name (UNIMPLEMENTED). This needs instrumentation >> support and >> >> // possibly IR changes. >> >> u64 data_type_id; >> >> } >> HIB Table >> >> The Heap Info Block Table, which is a multi-way associative cache, holds >> HIB objects from deallocated objects. It is indexed by the stack allocation >> context id from the chunk header, and currently utilizes a simple mod with >> a prime number close to a power of two as the hash (because of the way the >> stack context ids are assigned, a mod of a power of two performs very >> poorly). Thus far, only 4-way associativity has been evaluated. >> >> >> >> HIB entries are added or merged into the HIB Table on each deallocation. >> If an entry with a matching stack alloc context id is found in the Table, >> the newly deallocated information is merged into the existing entry. Each >> HIB Table entry currently tracks the min, max and total value of the >> various fields for use in computing and reporting the min, max and average >> when the Table is ultimately dumped. >> >> >> >> If no entry with a matching stack alloc context id is found, a new entry >> is created. If this causes an eviction, the evicted entry is dumped >> immediately (by default to stderr, otherwise to a specified report file). >> Later post processing can merge dumped entries with the same stack alloc >> context id. >> Initialization >> >> >> >> For ASAN, an __asan_init function initializes the memory allocation >> tracking support, and the ASAN instrumentation pass in LLVM creates a >> global constructor to invoke it. The heap profiler prototype adds a new >> __heapprof_init function, which performs heap profile specific >> initialization, and the heap profile instrumentation pass calls this new >> init function instead by a generated global constructor. It currently >> additionally invokes __asan_init since we are leveraging a modified ASAN >> runtime. Eventually, this should be changed to initialize refactored common >> support. >> >> >> >> Note that __asan init is also placed in the .preinit_array when it is >> available, so it is invoked even earlier than global constructors. >> Currently, it is not possible to do this for __heapprof_init, as it calls >> timespec_get in order to get a baseline timestamp (as described in the >> ChunkHeader comments the timestamps (ms) are actually offsets from the >> baseline timestamp, in order to fit into 32 bits), and system calls cannot >> be made that early (dl_init is not complete). Since the constructor >> priority is 1, it should be executed early enough that there are very few >> allocations before it runs, and likely the best solution is to simply >> ignore any allocations before initialization. >> Dumping >> >> For the prototype, the profile is dumped as text with a compact raw >> format to limit its size. Ultimately it should be dumped in a more compact >> binary format (i.e. into a different section of the raw instrumentation >> based profile, with llvm-profdata performing post-processing) which is TBD. >> HIB Dumping >> >> As noted earlier, HIB Table entries are created as memory is deallocated. >> At the end of the run (or whenever dumping is requested, discussed later), >> HIB entries need to be created for allocations that are still live. >> Conveniently, the sanitizer allocator already contains a mechanism to walk >> through all chunks of memory it is tracking (ForEachChunk). The heap >> profiler simply looks for all chunks with a chunk state of allocated, and >> creates a HIB the same as would be done on deallocation, adding each to the >> table. >> >> >> >> A HIB Table mechanism for printing each entry is then invoked. >> >> >> >> By default, the dumping occurs: >> >> - on evictions >> - full table at exit (when the static Allocator object is destructed) >> >> >> >> For running in a load testing scenario, we will want to add a mechanism >> to provoke finalization (merging currently live allocations) and dumping of >> the HIB Table before exit. This would be similar to the __llvm_profile_dump >> facility used for normal PGO counter dumping. >> Stack Trace Dumping >> >> There is existing support for dumping symbolized StackTrace objects. A >> wrapper to dump all StackTrace objects in the StackDepot will be added. >> This new interface is invoked just after the HIB Table is dumped (on exit >> or via dumping interface). >> Memory Map Dumping >> >> In cases where we may want to symbolize as a post processing step, we may >> need the memory map (from /proc/self/smaps). Specifically, this is needed >> to symbolize binaries using ASLR (Address Space Layout Randomization). >> There is already support for reading this file and dumping it to the >> specified report output file (DumpProcessMap()). This is invoked when the >> profile output file is initialized (HIB Table construction), so that the >> memory map is available at the top of the raw profile. >> Current Status and Next Steps >> >> >> >> As mentioned earlier, I have a working prototype based on a simplified >> stripped down version of ASAN. My current plan is to do the following: >> >> 1. Refactor out some of the shadow setup code common between ASAN and >> HWASAN into sanitizer_common. >> 2. Rework my prototype into a separate heapprof library in >> compiler-rt, using sanitizer_common support where possible, and send >> patches for review. >> 3. Send patches for the heap profiler instrumentation pass and >> related clang options. >> 4. Design/implement binary profile format >> >> >> >> -- >> >> Teresa Johnson | >> >> Software Engineer | >> >> tejohnson at google.com | >> >> >> >-- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/8cadee89/attachment.html>
Mitch Phillips via llvm-dev
2020-Jul-06 18:48 UTC
[llvm-dev] RFC: Sanitizer-based Heap Profiler
> I'm not aware that -fsanitizer* options disable these, but I know in ourenvironment we do disable frame pointer omission when setting up ASAN builds, and I am arranging for heap profiling builds to do the same. Not sure whether we want to do this within clang itself, would be interested in Kostya's opinion. I can't see anywhere that we are disabling tail call optimizations for ASAN though, but I might have missed it. We don't force frame pointers to be emitted with -fsanitize=address at least -- although we highly recommend it <https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame pointer unwinder is much faster than DWARF, particularly important for stack collection on malloc/free. On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi Wenlei, > > Thanks for the comments! David answered the first question, I do have some > comments on the second one though. > Teresa > > On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> > wrote: > >> >> >> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote: >> >>> This sounds very useful. We’ve improved and used memoro >>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and >>> analysis, and we are also looking for ways to leverage memory profile for >>> PGO/FDO. I think having a common profiling infrastructure for analysis >>> tooling as well as profile guided optimizations is good design, and having >>> it in LLVM is also helpful. Very interested in the tooling and optimization >>> that comes after the profiler. >>> >>> >>> >>> Two questions: >>> >>> - How does the profiling overhead look? Is that similar to ASAN >>> overhead from what you’ve seen, which would be higher than PGO >>> instrumentation? Asking because I’m wondering if any PGO training setup can >>> be used directly for the new heap profiling. >>> >>> >> It is built on top of ASAN runtime, but the overhead can be made much >> lower by using counter update consolidation -- all fields sharing the same >> shadow counter can be merged, and aggressive loop sinking/hoisting can be >> done. >> >> The goal is to integrate this with the PGO instrumentation. The PGO >> instrumentation overhead can be further reduced with sampling technique >> (Rong Xu has a patch to be submitted). >> >> >>> - >>> - I’m not familiar with how sanitizer handles stack trace, but for >>> getting most accurate calling context (use FP rather than dwarf), I guess >>> frame pointer omission and tail call opt etc. need to be turned off? Is >>> that going to be implied by -fheapprof? >>> >>> >> Kostya can provide detailed answers to these questions. >> > > I'm not aware that -fsanitizer* options disable these, but I know in our > environment we do disable frame pointer omission when setting up ASAN > builds, and I am arranging for heap profiling builds to do the same. Not > sure whether we want to do this within clang itself, would be interested in > Kostya's opinion. I can't see anywhere that we are disabling tail call > optimizations for ASAN though, but I might have missed it. > > Thanks, > Teresa > > >> >> David >> >>> >>> - >>> >>> >>> >>> Thanks, >>> >>> Wenlei >>> >>> >>> >>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa >>> Johnson via llvm-dev <llvm-dev at lists.llvm.org> >>> *Reply-To: *Teresa Johnson <tejohnson at google.com> >>> *Date: *Wednesday, June 24, 2020 at 4:58 PM >>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany < >>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka < >>> vitalybuka at google.com> >>> *Cc: *David Li <davidxl at google.com> >>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler >>> >>> >>> >>> Hi all, >>> >>> >>> >>> I've included an RFC for a heap profiler design I've been working on in >>> conjunction with David Li. Please send any questions or feedback. For >>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN >>> shadow setup code (see the Shadow Memory section). >>> >>> >>> >>> Thanks, >>> >>> Teresa >>> >>> >>> >>> RFC: Sanitizer-based Heap Profiler >>> Summary >>> >>> This document provides an overview of an LLVM Sanitizer-based heap >>> profiler design. >>> Motivation >>> >>> The objective of heap memory profiling is to collect critical runtime >>> information associated with heap memory references and information on heap >>> memory allocations. The profile information will be used first for tooling, >>> and subsequently to guide the compiler optimizer and allocation runtime to >>> layout heap objects with improved spatial locality. As a result, DTLB and >>> cache utilization will be improved, and program IPC (performance) will be >>> increased due to reduced TLB and cache misses. More details on the heap >>> profile guided optimizations will be shared in the future. >>> Overview >>> >>> The profiler is based on compiler inserted instrumentation of load and >>> store accesses, and utilizes runtime support to monitor heap allocations >>> and profile data. The target consumer of the heap memory profile >>> information is initially tooling and ultimately automatic data layout >>> optimizations performed by the compiler and/or allocation runtime (with the >>> support of new allocation runtime APIs). >>> >>> >>> >>> Each memory address is mapped to Shadow Memory >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>, >>> similar to the approach used by the Address Sanitizer >>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). >>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the >>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow >>> location implements the profile counter (incremented on accesses to the >>> corresponding memory). This granularity was chosen to help avoid counter >>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. >>> To avoid aliasing of shadow memory for different allocations, we must >>> choose a minimum alignment carefully. As discussed further below, we can >>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by >>> storing necessary heap information for each allocation in a 32-byte header >>> block. >>> >>> >>> >>> The compiler instruments each load and store to increment the associated >>> shadow memory counter, in order to determine hotness. >>> >>> >>> >>> The heap profiler runtime is responsible for tracking allocations and >>> deallocations, including the stack at each allocation, and information such >>> as the allocation size and other statistics. I have implemented a prototype >>> built using a stripped down and modified version of ASAN, however this will >>> be a separate library utilizing sanitizer_common components. >>> Compiler >>> >>> A simple HeapProfiler instrumentation pass instruments interesting >>> memory accesses (loads, stores, atomics), with a simple load, increment, >>> store of the associated shadow memory location (computed via a mask and >>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow >>> offset). The handling is very similar to and based off of the ASAN >>> instrumentation pass, with slightly different instrumentation code. >>> >>> >>> >>> Various techniques can be used to reduce the overhead, by aggressively >>> coalescing counter updates (e.g. given the 32-byte alignment, accesses >>> known to be in the same 32-byte block, or across possible aliases since we >>> don’t care about the dereferenced values). >>> >>> >>> >>> Additionally, the Clang driver needs to set up to link with the runtime >>> library, much as it does with the sanitizers. >>> >>> >>> >>> A -fheapprof option is added to enable the instrumentation pass and >>> runtime library linking. Similar to -fprofile-generate, -fheapprof will >>> accept an argument specifying the directory in which to write the profile. >>> Runtime >>> >>> The heap profiler runtime is responsible for tracking and reporting >>> information about heap allocations and accesses, aggregated by allocation >>> calling context. For example, the hotness, lifetime, and cpu affinity. >>> >>> >>> >>> A new heapprof library will be created within compiler-rt. It will >>> leverage support within sanitizer_common, which already contains facilities >>> like stack context tracking, needed by the heap profiler. >>> Shadow Memory >>> >>> There are some basic facilities in sanitizer_common for mmap’ing the >>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN >>> libraries. In the case of ASAN, there is support for both statically >>> assigned shadow offsets (the default on most platforms), and for >>> dynamically assigned shadow memory (implemented for Windows and currently >>> also used for Android and iOS). According to kcc, recent experiments show >>> that the performance with a dynamic shadow is close to that with a static >>> mapping. In fact, that is the only approach currently used by HWASAN. Given >>> the simplicity, the heap profiler will be implemented with a dynamic shadow >>> as well. >>> >>> >>> >>> There are a number of functions in ASAN and HWASAN related to setup of >>> the shadow that are duplicated but very nearly identical, at least for >>> linux (which seems to be the only OS flavor currently supported for >>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and >>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in >>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is >>> already commoned with the non-premap handling). Rather than make yet >>> another copy of these mechanisms, I propose refactoring them into >>> sanitizer_common versions. Like HWASAN, the initial version of the heap >>> profiler will be supported for linux only, but other OSes can be added as >>> needed similar to ASAN. >>> StackTrace and StackDepot >>> >>> The sanitizer already contains support for obtaining and representing a >>> stack trace in a StackTrace object, and storing it in the StackDepot which >>> “efficiently stores huge amounts of stack traces”. This is in the >>> sanitizer_common subdirectory and the support is shared by ASAN and >>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table, >>> where each StackTrace is assigned a unique id. ASAN stores this id in the >>> alloc_context_id field in each ChunkHeader (in the redzone preceding each >>> allocation). Additionally, there is support for symbolizing and printing >>> StackTrace objects. >>> ChunkHeader >>> >>> The heap profiler needs to track several pieces of information for each >>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve >>> a minimum of 32-byte alignment by holding this information in a 32-byte >>> header block preceding each allocation. >>> >>> >>> >>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It >>> contains information about the current allocation state, user requested >>> size, allocation and free thread ids, the allocation context id >>> (representing the call stack at allocation, assigned by the StackDepot as >>> described above), and misc other bookkeeping. For heap profiling, this will >>> be converted to a 32-byte header block. >>> >>> >>> >>> Note that we could instead use the metadata section, similar to other >>> sanitizers, which is stored in a separate location. However, as described >>> above, storing the header block with each allocation enables 32-byte >>> alignment without aliasing shadow counters for the same 64 bytes of memory. >>> >>> >>> >>> In the prototype heap profiler implementation, the header contains the >>> following fields: >>> >>> >>> >>> // Should be 32 bytes >>> >>> struct ChunkHeader { >>> >>> // 1-st 4 bytes >>> >>> // Carry over from ASAN (available, allocated, quarantined). Will be >>> >>> // reduced to 1 bit (available or allocated). >>> >>> u32 chunk_state : 8; >>> >>> // Carry over from ASAN. Used to determine the start of user >>> allocation. >>> >>> u32 from_memalign : 1; >>> >>> // 23 bits available >>> >>> >>> >>> // 2-nd 4 bytes >>> >>> // Carry over from ASAN (comment copied verbatim). >>> >>> // This field is used for small sizes. For large sizes it is equal to >>> >>> // SizeClassMap::kMaxSize and the actual size is stored in the >>> >>> // SecondaryAllocator's metadata. >>> >>> u32 user_requested_size : 29; >>> >>> >>> >>> // 3-rd 4 bytes >>> >>> u32 cpu_id; // Allocation cpu id >>> >>> >>> >>> // 4-th 4 bytes >>> >>> // Allocation timestamp in ms from a baseline timestamp computed at >>> >>> // the start of profiling (to keep this within 32 bits). >>> >>> u32 timestamp_ms; >>> >>> >>> >>> // 5-th and 6-th 4 bytes >>> >>> // Carry over from ASAN. Used to identify allocation stack trace. >>> >>> u64 alloc_context_id; >>> >>> >>> >>> // 7-th and 8-th 4 bytes >>> >>> // UNIMPLEMENTED in prototype - needs instrumentation and IR support. >>> >>> u64 data_type_id; // hash of type name >>> >>> }; >>> >>> As noted, the chunk state can be reduced to a single bit (no need for >>> quarantined memory in the heap profiler). The header contains a placeholder >>> for the data type hash, which is not yet implemented as it needs >>> instrumentation and IR support. >>> Heap Info Block (HIB) >>> >>> On a deallocation, information from the corresponding shadow block(s) >>> and header are recorded in a Heap Info Block (HIB) object. The access count >>> is computed from the shadow memory locations for the allocation, as well as >>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero >>> 8-byte shadow locations for the whole allocation). Other information such >>> as the deallocation timestamp (for lifetime computation) and deallocation >>> cpu id (to determine migrations) are recorded along with the information in >>> the chunk header recorded on allocation. >>> >>> >>> >>> The prototyped HIB object tracks the following: >>> >>> >>> >>> struct HeapInfoBlock { >>> >>> // Total allocations at this stack context >>> >>> u32 alloc_count; >>> >>> // Access count computed from all allocated 64-byte blocks (track total >>> >>> // across all allocations, and the min and max). >>> >>> u64 total_access_count, min_access_count, max_access_count; >>> >>> // Allocated size (track total across all allocations, and the min and >>> max). >>> >>> u64 total_size; >>> >>> u32 min_size, max_size; >>> >>> // Lifetime (track total across all allocations, and the min and max). >>> >>> u64 total_lifetime; >>> >>> u32 min_lifetime, max_lifetime; >>> >>> // Percent utilization of allocated 64-byte blocks (track total >>> >>> // across all allocations, and the min and max). The utilization is >>> >>> // defined as the percentage of 8-byte shadow counters corresponding to >>> >>> // the full allocation that are non-zero. >>> >>> u64 total_percent_utilized; >>> >>> u32 min_percent_utilized, max_percent_utilized; >>> >>> // Allocation and deallocation timestamps from the most recent merge >>> into >>> >>> // the table with this stack context. >>> >>> u32 alloc_timestamp, dealloc_timestamp; >>> >>> // Allocation and deallocation cpu ids from the most recent merge into >>> >>> // the table with this stack context. >>> >>> u32 alloc_cpu_id, dealloc_cpu_id; >>> >>> // Count of allocations at this stack context that had a different >>> >>> // allocation and deallocation cpu id. >>> >>> u32 num_migrated_cpu; >>> >>> // Number of times the lifetime of the entry being merged had its >>> lifetime >>> >>> // overlap with the previous entry merged with this stack context (by >>> >>> // comparing the new alloc/dealloc timestamp with the one last >>> recorded in >>> >>> // the entry in the table. >>> >>> u32 num_lifetime_overlaps; >>> >>> // Number of times the alloc/dealloc cpu of the entry being merged was >>> the >>> >>> // same as that of the previous entry merged with this stack context >>> >>> u32 num_same_alloc_cpu; >>> >>> u32 num_same_dealloc_cpu; >>> >>> // Hash of type name (UNIMPLEMENTED). This needs instrumentation >>> support and >>> >>> // possibly IR changes. >>> >>> u64 data_type_id; >>> >>> } >>> HIB Table >>> >>> The Heap Info Block Table, which is a multi-way associative cache, holds >>> HIB objects from deallocated objects. It is indexed by the stack allocation >>> context id from the chunk header, and currently utilizes a simple mod with >>> a prime number close to a power of two as the hash (because of the way the >>> stack context ids are assigned, a mod of a power of two performs very >>> poorly). Thus far, only 4-way associativity has been evaluated. >>> >>> >>> >>> HIB entries are added or merged into the HIB Table on each deallocation. >>> If an entry with a matching stack alloc context id is found in the Table, >>> the newly deallocated information is merged into the existing entry. Each >>> HIB Table entry currently tracks the min, max and total value of the >>> various fields for use in computing and reporting the min, max and average >>> when the Table is ultimately dumped. >>> >>> >>> >>> If no entry with a matching stack alloc context id is found, a new entry >>> is created. If this causes an eviction, the evicted entry is dumped >>> immediately (by default to stderr, otherwise to a specified report file). >>> Later post processing can merge dumped entries with the same stack alloc >>> context id. >>> Initialization >>> >>> >>> >>> For ASAN, an __asan_init function initializes the memory allocation >>> tracking support, and the ASAN instrumentation pass in LLVM creates a >>> global constructor to invoke it. The heap profiler prototype adds a new >>> __heapprof_init function, which performs heap profile specific >>> initialization, and the heap profile instrumentation pass calls this new >>> init function instead by a generated global constructor. It currently >>> additionally invokes __asan_init since we are leveraging a modified ASAN >>> runtime. Eventually, this should be changed to initialize refactored common >>> support. >>> >>> >>> >>> Note that __asan init is also placed in the .preinit_array when it is >>> available, so it is invoked even earlier than global constructors. >>> Currently, it is not possible to do this for __heapprof_init, as it calls >>> timespec_get in order to get a baseline timestamp (as described in the >>> ChunkHeader comments the timestamps (ms) are actually offsets from the >>> baseline timestamp, in order to fit into 32 bits), and system calls cannot >>> be made that early (dl_init is not complete). Since the constructor >>> priority is 1, it should be executed early enough that there are very few >>> allocations before it runs, and likely the best solution is to simply >>> ignore any allocations before initialization. >>> Dumping >>> >>> For the prototype, the profile is dumped as text with a compact raw >>> format to limit its size. Ultimately it should be dumped in a more compact >>> binary format (i.e. into a different section of the raw instrumentation >>> based profile, with llvm-profdata performing post-processing) which is TBD. >>> HIB Dumping >>> >>> As noted earlier, HIB Table entries are created as memory is >>> deallocated. At the end of the run (or whenever dumping is requested, >>> discussed later), HIB entries need to be created for allocations that are >>> still live. Conveniently, the sanitizer allocator already contains a >>> mechanism to walk through all chunks of memory it is tracking ( >>> ForEachChunk). The heap profiler simply looks for all chunks with a >>> chunk state of allocated, and creates a HIB the same as would be done on >>> deallocation, adding each to the table. >>> >>> >>> >>> A HIB Table mechanism for printing each entry is then invoked. >>> >>> >>> >>> By default, the dumping occurs: >>> >>> - on evictions >>> - full table at exit (when the static Allocator object is destructed) >>> >>> >>> >>> For running in a load testing scenario, we will want to add a mechanism >>> to provoke finalization (merging currently live allocations) and dumping of >>> the HIB Table before exit. This would be similar to the __llvm_profile_dump >>> facility used for normal PGO counter dumping. >>> Stack Trace Dumping >>> >>> There is existing support for dumping symbolized StackTrace objects. A >>> wrapper to dump all StackTrace objects in the StackDepot will be added. >>> This new interface is invoked just after the HIB Table is dumped (on exit >>> or via dumping interface). >>> Memory Map Dumping >>> >>> In cases where we may want to symbolize as a post processing step, we >>> may need the memory map (from /proc/self/smaps). Specifically, this is >>> needed to symbolize binaries using ASLR (Address Space Layout >>> Randomization). There is already support for reading this file and dumping >>> it to the specified report output file (DumpProcessMap()). This is invoked >>> when the profile output file is initialized (HIB Table construction), so >>> that the memory map is available at the top of the raw profile. >>> Current Status and Next Steps >>> >>> >>> >>> As mentioned earlier, I have a working prototype based on a simplified >>> stripped down version of ASAN. My current plan is to do the following: >>> >>> 1. Refactor out some of the shadow setup code common between ASAN >>> and HWASAN into sanitizer_common. >>> 2. Rework my prototype into a separate heapprof library in >>> compiler-rt, using sanitizer_common support where possible, and send >>> patches for review. >>> 3. Send patches for the heap profiler instrumentation pass and >>> related clang options. >>> 4. Design/implement binary profile format >>> >>> >>> >>> -- >>> >>> Teresa Johnson | >>> >>> Software Engineer | >>> >>> tejohnson at google.com | >>> >>> >>> >> > > -- > Teresa Johnson | Software Engineer | tejohnson at google.com | > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/e374fa7e/attachment.html>