thr3ads.net - llvm dev - [llvm-dev] RFC: Sanitizer-based Heap Profiler [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2020-Jul-05 20:43 UTC

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
> This sounds very useful. We’ve improved and used memoro
> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling
and
> analysis, and we are also looking for ways to leverage memory profile for
> PGO/FDO. I think having a common profiling infrastructure for analysis
> tooling as well as profile guided optimizations is good design, and having
> it in LLVM is also helpful. Very interested in the tooling and optimization
> that comes after the profiler.
>
>
>
> Two questions:
>
>    - How does the profiling overhead look? Is that similar to ASAN
>    overhead from what you’ve seen, which would be higher than PGO
>    instrumentation? Asking because I’m wondering if any PGO training setup
can
>    be used directly for the new heap profiling.
>
>It is built on top of ASAN runtime, but the overhead can be made much lower
by using counter update consolidation -- all fields sharing the same shadow
counter can be merged, and aggressive loop sinking/hoisting can be done.

The goal is to integrate this with the PGO instrumentation. The PGO
instrumentation overhead can be further reduced with sampling technique
(Rong Xu has a patch to be submitted).

>    -
>    - I’m not familiar with how sanitizer handles stack trace, but for
>    getting most accurate calling context (use FP rather than dwarf), I
guess
>    frame pointer omission and tail call opt etc. need to be turned off? Is
>    that going to be implied by -fheapprof?
>
>Kostya can provide detailed answers to these questions.

David
>
>    -
>
>
>
> Thanks,
>
> Wenlei
>
>
>
> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Teresa
> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
> *Reply-To: *Teresa Johnson <tejohnson at google.com>
> *Date: *Wednesday, June 24, 2020 at 4:58 PM
> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>,
Vitaly Buka <
> vitalybuka at google.com>
> *Cc: *David Li <davidxl at google.com>
> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>
>
>
> Hi all,
>
>
>
> I've included an RFC for a heap profiler design I've been working
on in
> conjunction with David Li. Please send any questions or feedback. For
> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
> shadow setup code (see the Shadow Memory section).
>
>
>
> Thanks,
>
> Teresa
>
>
>
> RFC: Sanitizer-based Heap Profiler
> Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap
> profiler design.
> Motivation
>
> The objective of heap memory profiling is to collect critical runtime
> information associated with heap memory references and information on heap
> memory allocations. The profile information will be used first for tooling,
> and subsequently to guide the compiler optimizer and allocation runtime to
> layout heap objects with improved spatial locality. As a  result, DTLB and
> cache utilization will be improved, and program IPC (performance) will be
> increased due to reduced TLB and cache misses. More details on the heap
> profile guided optimizations will be shared in the future.
> Overview
>
> The profiler is based on compiler inserted instrumentation of load and
> store accesses, and utilizes runtime support to monitor heap allocations
> and profile data. The target consumer of the heap memory profile
> information is initially tooling and ultimately automatic data layout
> optimizations performed by the compiler and/or allocation runtime (with the
> support of new allocation runtime APIs).
>
>
>
> Each memory address is mapped to Shadow Memory
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
> similar to the approach used by the Address Sanitizer
> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
> location implements the profile counter (incremented on accesses to the
> corresponding memory). This granularity was chosen to help avoid counter
> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
> To avoid aliasing of shadow memory for different allocations, we must
> choose a minimum alignment carefully. As discussed further below, we can
> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
> storing necessary heap information for each allocation in a 32-byte header
> block.
>
>
>
> The compiler instruments each load and store to increment the associated
> shadow memory counter, in order to determine hotness.
>
>
>
> The heap profiler runtime is responsible for tracking allocations and
> deallocations, including the stack at each allocation, and information such
> as the allocation size and other statistics. I have implemented a prototype
> built using a stripped down and modified version of ASAN, however this will
> be a separate library utilizing sanitizer_common components.
> Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting memory
> accesses (loads, stores, atomics), with a simple load, increment, store of
> the associated shadow memory location (computed via a mask and shift to do
> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
> The handling is very similar to and based off of the ASAN instrumentation
> pass, with slightly different instrumentation code.
>
>
>
> Various techniques can be used to reduce the overhead, by aggressively
> coalescing counter updates (e.g. given the 32-byte alignment, accesses
> known to be in the same 32-byte block, or across possible aliases since we
> don’t care about the dereferenced values).
>
>
>
> Additionally, the Clang driver needs to set up to link with the runtime
> library, much as it does with the sanitizers.
>
>
>
> A -fheapprof option is added to enable the instrumentation pass and
> runtime library linking. Similar to -fprofile-generate, -fheapprof will
> accept an argument specifying the directory in which to write the profile.
> Runtime
>
> The heap profiler runtime is responsible for tracking and reporting
> information about heap allocations and accesses, aggregated by allocation
> calling context. For example, the hotness, lifetime, and cpu affinity.
>
>
>
> A new heapprof library will be created within compiler-rt. It will
> leverage support within sanitizer_common, which already contains facilities
> like stack context tracking, needed by the heap profiler.
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
>
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
> StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing a
> stack trace in a StackTrace object, and storing it in the StackDepot which
> “efficiently stores huge amounts of stack traces”. This is in the
> sanitizer_common subdirectory and the support is shared by ASAN and
> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
> where each StackTrace is assigned a unique id. ASAN stores this id in the
> alloc_context_id field in each ChunkHeader (in the redzone preceding each
> allocation). Additionally, there is support for symbolizing and printing
> StackTrace objects.
> ChunkHeader
>
> The heap profiler needs to track several pieces of information for each
> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
> a minimum of 32-byte alignment by holding this information in a 32-byte
> header block preceding each allocation.
>
>
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
> information about the current allocation state, user requested size,
> allocation and free thread ids, the allocation context id (representing the
> call stack at allocation, assigned by the StackDepot as described above),
> and misc other bookkeeping. For heap profiling, this will be converted to a
> 32-byte header block.
>
>
>
> Note that we could instead use the metadata section, similar to other
> sanitizers, which is stored in a separate location. However, as described
> above, storing the header block with each allocation enables 32-byte
> alignment without aliasing shadow counters for the same 64 bytes of memory.
>
>
>
> In the prototype heap profiler implementation, the header contains the
> following fields:
>
>
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for
> quarantined memory in the heap profiler). The header contains a placeholder
> for the data type hash, which is not yet implemented as it needs
> instrumentation and IR support.
> Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) and
> header are recorded in a Heap Info Block (HIB) object. The access count is
> computed from the shadow memory locations for the allocation, as well as
> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
> 8-byte shadow locations for the whole allocation). Other information such
> as the deallocation timestamp (for lifetime computation) and deallocation
> cpu id (to determine migrations) are recorded along with the information in
> the chunk header recorded on allocation.
>
>
>
> The prototyped HIB object tracks the following:
>
>
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min and
> max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last recorded
> in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged was
> the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
> and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
> HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, holds
> HIB objects from deallocated objects. It is indexed by the stack allocation
> context id from the chunk header, and currently utilizes a simple mod with
> a prime number close to a power of two as the hash (because of the way the
> stack context ids are assigned, a mod of a power of two performs very
> poorly). Thus far, only 4-way associativity has been evaluated.
>
>
>
> HIB entries are added or merged into the HIB Table on each deallocation.
> If an entry with a matching stack alloc context id is found in the Table,
> the newly deallocated information is merged into the existing entry. Each
> HIB Table entry currently tracks the min, max and total value of the
> various fields for use in computing and reporting the min, max and average
> when the Table is ultimately dumped.
>
>
>
> If no entry with a matching stack alloc context id is found, a new entry
> is created. If this causes an eviction, the evicted entry is dumped
> immediately (by default to stderr, otherwise to a specified report file).
> Later post processing can merge dumped entries with the same stack alloc
> context id.
> Initialization
>
>
>
> For ASAN, an __asan_init function initializes the memory allocation
> tracking support, and the ASAN instrumentation pass in LLVM creates a
> global constructor to invoke it. The heap profiler prototype adds a new
> __heapprof_init function, which performs heap profile specific
> initialization, and the heap profile instrumentation pass calls this new
> init function instead by a generated global constructor. It currently
> additionally invokes __asan_init since we are leveraging a modified ASAN
> runtime. Eventually, this should be changed to initialize refactored common
> support.
>
>
>
> Note that __asan init is also placed in the .preinit_array when it is
> available, so it is invoked even earlier than global constructors.
> Currently, it is not possible to do this for __heapprof_init, as it calls
> timespec_get in order to get a baseline timestamp (as described in the
> ChunkHeader comments the timestamps (ms) are actually offsets from the
> baseline timestamp, in order to fit into 32 bits), and system calls cannot
> be made that early (dl_init is not complete). Since the constructor
> priority is 1, it should be executed early enough that there are very few
> allocations before it runs, and likely the best solution is to simply
> ignore any allocations before initialization.
> Dumping
>
> For the prototype, the profile is dumped as text with a compact raw format
> to limit its size. Ultimately it should be dumped in a more compact binary
> format (i.e. into a different section of the raw instrumentation based
> profile, with llvm-profdata performing post-processing) which is TBD.
> HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is deallocated.
> At the end of the run (or whenever dumping is requested, discussed later),
> HIB entries need to be created for allocations that are still live.
> Conveniently, the sanitizer allocator already contains a mechanism to walk
> through all chunks of memory it is tracking (ForEachChunk). The heap
> profiler simply looks for all chunks with a chunk state of allocated, and
> creates a HIB the same as would be done on deallocation, adding each to the
> table.
>
>
>
> A HIB Table mechanism for printing each entry is then invoked.
>
>
>
> By default, the dumping occurs:
>
>    - on evictions
>    - full table at exit (when the static Allocator object is destructed)
>
>
>
> For running in a load testing scenario, we will want to add a mechanism to
> provoke finalization (merging currently live allocations) and dumping of
> the HIB Table before exit. This would be similar to the __llvm_profile_dump
> facility used for normal PGO counter dumping.
> Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A
> wrapper to dump all StackTrace objects in the StackDepot will be added.
> This new interface is invoked just after the HIB Table is dumped (on exit
> or via dumping interface).
> Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we may
> need the memory map (from /proc/self/smaps). Specifically, this is needed
> to symbolize binaries using ASLR (Address Space Layout Randomization).
> There is already support for reading this file and dumping it to the
> specified report output file (DumpProcessMap()). This is invoked when the
> profile output file is initialized (HIB Table construction), so that the
> memory map is available at the top of the raw profile.
> Current Status and Next Steps
>
>
>
> As mentioned earlier, I have a working prototype based on a simplified
> stripped down version of ASAN. My current plan is to do the following:
>
>    1. Refactor out some of the shadow setup code common between ASAN and
>    HWASAN into sanitizer_common.
>    2. Rework my prototype into a separate heapprof library in
>    compiler-rt, using sanitizer_common support where possible, and send
>    patches for review.
>    3. Send patches for the heap profiler instrumentation pass and related
>    clang options.
>    4. Design/implement binary profile format
>
>
>
> --
>
> Teresa Johnson |
>
>  Software Engineer |
>
>  tejohnson at google.com |
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/51c0a31c/attachment-0001.html>

Teresa Johnson via llvm-dev

2020-Jul-06 14:58 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi Wenlei,

Thanks for the comments! David answered the first question, I do have some
comments on the second one though.
Teresa

On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com>
wrote:
>
>
> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
>
>> This sounds very useful. We’ve improved and used memoro
>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory
profiling and
>> analysis, and we are also looking for ways to leverage memory profile
for
>> PGO/FDO. I think having a common profiling infrastructure for analysis
>> tooling as well as profile guided optimizations is good design, and
having
>> it in LLVM is also helpful. Very interested in the tooling and
optimization
>> that comes after the profiler.
>>
>>
>>
>> Two questions:
>>
>>    - How does the profiling overhead look? Is that similar to ASAN
>>    overhead from what you’ve seen, which would be higher than PGO
>>    instrumentation? Asking because I’m wondering if any PGO training
setup can
>>    be used directly for the new heap profiling.
>>
>>
> It is built on top of ASAN runtime, but the overhead can be made much
> lower by using counter update consolidation -- all fields sharing the same
> shadow counter can be merged, and aggressive loop sinking/hoisting can be
> done.
>
> The goal is to integrate this with the PGO instrumentation. The PGO
> instrumentation overhead can be further reduced with sampling technique
> (Rong Xu has a patch to be submitted).
>
>
>>    -
>>    - I’m not familiar with how sanitizer handles stack trace, but for
>>    getting most accurate calling context (use FP rather than dwarf), I
guess
>>    frame pointer omission and tail call opt etc. need to be turned off?
Is
>>    that going to be implied by -fheapprof?
>>
>>
> Kostya can provide detailed answers to these questions.
>
I'm not aware that -fsanitizer* options disable these, but I know in our
environment we do disable frame pointer omission when setting up ASAN
builds, and I am arranging for heap profiling builds to do the same. Not
sure whether we want to do this within clang itself, would be interested in
Kostya's opinion. I can't see anywhere that we are disabling tail call
optimizations for ASAN though, but I might have missed it.

Thanks,
Teresa

>
> David
>
>>
>>    -
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>
>>
>>
>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf
of Teresa
>> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
>> *Reply-To: *Teresa Johnson <tejohnson at google.com>
>> *Date: *Wednesday, June 24, 2020 at 4:58 PM
>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany
<
>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>,
Vitaly Buka <
>> vitalybuka at google.com>
>> *Cc: *David Li <davidxl at google.com>
>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>>
>>
>>
>> Hi all,
>>
>>
>>
>> I've included an RFC for a heap profiler design I've been
working on in
>> conjunction with David Li. Please send any questions or feedback. For
>> sanitizer folks, one area of feedback is on refactoring some of the
*ASAN
>> shadow setup code (see the Shadow Memory section).
>>
>>
>>
>> Thanks,
>>
>> Teresa
>>
>>
>>
>> RFC: Sanitizer-based Heap Profiler
>> Summary
>>
>> This document provides an overview of an LLVM Sanitizer-based heap
>> profiler design.
>> Motivation
>>
>> The objective of heap memory profiling is to collect critical runtime
>> information associated with heap memory references and information on
heap
>> memory allocations. The profile information will be used first for
tooling,
>> and subsequently to guide the compiler optimizer and allocation runtime
to
>> layout heap objects with improved spatial locality. As a  result, DTLB
and
>> cache utilization will be improved, and program IPC (performance) will
be
>> increased due to reduced TLB and cache misses. More details on the heap
>> profile guided optimizations will be shared in the future.
>> Overview
>>
>> The profiler is based on compiler inserted instrumentation of load and
>> store accesses, and utilizes runtime support to monitor heap
allocations
>> and profile data. The target consumer of the heap memory profile
>> information is initially tooling and ultimately automatic data layout
>> optimizations performed by the compiler and/or allocation runtime (with
the
>> support of new allocation runtime APIs).
>>
>>
>>
>> Each memory address is mapped to Shadow Memory
>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
>> similar to the approach used by the Address Sanitizer
>> <https://github.com/google/sanitizers/wiki/AddressSanitizer>
(ASAN).
>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>> location implements the profile counter (incremented on accesses to the
>> corresponding memory). This granularity was chosen to help avoid
counter
>> overflow, but it may be possible to consider mapping 32-bytes to 4
bytes.
>> To avoid aliasing of shadow memory for different allocations, we must
>> choose a minimum alignment carefully. As discussed further below, we
can
>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>> storing necessary heap information for each allocation in a 32-byte
header
>> block.
>>
>>
>>
>> The compiler instruments each load and store to increment the
associated
>> shadow memory counter, in order to determine hotness.
>>
>>
>>
>> The heap profiler runtime is responsible for tracking allocations and
>> deallocations, including the stack at each allocation, and information
such
>> as the allocation size and other statistics. I have implemented a
prototype
>> built using a stripped down and modified version of ASAN, however this
will
>> be a separate library utilizing sanitizer_common components.
>> Compiler
>>
>> A simple HeapProfiler instrumentation pass instruments interesting
memory
>> accesses (loads, stores, atomics), with a simple load, increment, store
of
>> the associated shadow memory location (computed via a mask and shift to
do
>> the mapping of 64 bytes to 8 byte shadow, and add of the shadow
offset).
>> The handling is very similar to and based off of the ASAN
instrumentation
>> pass, with slightly different instrumentation code.
>>
>>
>>
>> Various techniques can be used to reduce the overhead, by aggressively
>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>> known to be in the same 32-byte block, or across possible aliases since
we
>> don’t care about the dereferenced values).
>>
>>
>>
>> Additionally, the Clang driver needs to set up to link with the runtime
>> library, much as it does with the sanitizers.
>>
>>
>>
>> A -fheapprof option is added to enable the instrumentation pass and
>> runtime library linking. Similar to -fprofile-generate, -fheapprof will
>> accept an argument specifying the directory in which to write the
profile.
>> Runtime
>>
>> The heap profiler runtime is responsible for tracking and reporting
>> information about heap allocations and accesses, aggregated by
allocation
>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>
>>
>>
>> A new heapprof library will be created within compiler-rt. It will
>> leverage support within sanitizer_common, which already contains
facilities
>> like stack context tracking, needed by the heap profiler.
>> Shadow Memory
>>
>> There are some basic facilities in sanitizer_common for mmap’ing the
>> shadow memory, but most of the existing setup lives in the ASAN and
HWASAN
>> libraries. In the case of ASAN, there is support for both statically
>> assigned shadow offsets (the default on most platforms), and for
>> dynamically assigned shadow memory (implemented for Windows and
currently
>> also used for Android and iOS). According to kcc, recent experiments
show
>> that the performance with a dynamic shadow is close to that with a
static
>> mapping. In fact, that is the only approach currently used by HWASAN.
Given
>> the simplicity, the heap profiler will be implemented with a dynamic
shadow
>> as well.
>>
>>
>>
>> There are a number of functions in ASAN and HWASAN related to setup of
>> the shadow that are duplicated but very nearly identical, at least for
>> linux (which seems to be the only OS flavor currently supported for
>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>> FindDynamicShadowStart (in ASAN there is another nearly identical copy
in
>> PremapShadow, used by Android, whereas in HW ASAN the premap handling
is
>> already commoned with the non-premap handling). Rather than make yet
>> another copy of these mechanisms, I propose refactoring them into
>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>> profiler will be supported for linux only, but other OSes can be added
as
>> needed similar to ASAN.
>> StackTrace and StackDepot
>>
>> The sanitizer already contains support for obtaining and representing a
>> stack trace in a StackTrace object, and storing it in the StackDepot
which
>> “efficiently stores huge amounts of stack traces”. This is in the
>> sanitizer_common subdirectory and the support is shared by ASAN and
>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>> where each StackTrace is assigned a unique id. ASAN stores this id in
the
>> alloc_context_id field in each ChunkHeader (in the redzone preceding
each
>> allocation). Additionally, there is support for symbolizing and
printing
>> StackTrace objects.
>> ChunkHeader
>>
>> The heap profiler needs to track several pieces of information for each
>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can
achieve
>> a minimum of 32-byte alignment by holding this information in a 32-byte
>> header block preceding each allocation.
>>
>>
>>
>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>> contains information about the current allocation state, user requested
>> size, allocation and free thread ids, the allocation context id
>> (representing the call stack at allocation, assigned by the StackDepot
as
>> described above), and misc other bookkeeping. For heap profiling, this
will
>> be converted to a 32-byte header block.
>>
>>
>>
>> Note that we could instead use the metadata section, similar to other
>> sanitizers, which is stored in a separate location. However, as
described
>> above, storing the header block with each allocation enables 32-byte
>> alignment without aliasing shadow counters for the same 64 bytes of
memory.
>>
>>
>>
>> In the prototype heap profiler implementation, the header contains the
>> following fields:
>>
>>
>>
>> // Should be 32 bytes
>>
>> struct ChunkHeader {
>>
>>   // 1-st 4 bytes
>>
>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>
>>   // reduced to 1 bit (available or allocated).
>>
>>   u32 chunk_state       : 8;
>>
>>   // Carry over from ASAN. Used to determine the start of user
allocation.
>>
>>   u32 from_memalign     : 1;
>>
>>   // 23 bits available
>>
>>
>>
>>   // 2-nd 4 bytes
>>
>>   // Carry over from ASAN (comment copied verbatim).
>>
>>   // This field is used for small sizes. For large sizes it is equal to
>>
>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>
>>   // SecondaryAllocator's metadata.
>>
>>   u32 user_requested_size : 29;
>>
>>
>>
>>   // 3-rd 4 bytes
>>
>>   u32 cpu_id; // Allocation cpu id
>>
>>
>>
>>   // 4-th 4 bytes
>>
>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>
>>   // the start of profiling (to keep this within 32 bits).
>>
>>   u32 timestamp_ms;
>>
>>
>>
>>   // 5-th and 6-th 4 bytes
>>
>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>
>>   u64 alloc_context_id;
>>
>>
>>
>>   // 7-th and 8-th 4 bytes
>>
>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>
>>   u64 data_type_id; // hash of type name
>>
>> };
>>
>> As noted, the chunk state can be reduced to a single bit (no need for
>> quarantined memory in the heap profiler). The header contains a
placeholder
>> for the data type hash, which is not yet implemented as it needs
>> instrumentation and IR support.
>> Heap Info Block (HIB)
>>
>> On a deallocation, information from the corresponding shadow block(s)
and
>> header are recorded in a Heap Info Block (HIB) object. The access count
is
>> computed from the shadow memory locations for the allocation, as well
as
>> the percentage of accessed 64-byte blocks (i.e. the percentage of
non-zero
>> 8-byte shadow locations for the whole allocation). Other information
such
>> as the deallocation timestamp (for lifetime computation) and
deallocation
>> cpu id (to determine migrations) are recorded along with the
information in
>> the chunk header recorded on allocation.
>>
>>
>>
>> The prototyped HIB object tracks the following:
>>
>>
>>
>> struct HeapInfoBlock {
>>
>>   // Total allocations at this stack context
>>
>>   u32 alloc_count;
>>
>>   // Access count computed from all allocated 64-byte blocks (track
total
>>
>>   // across all allocations, and the min and max).
>>
>>   u64 total_access_count, min_access_count, max_access_count;
>>
>>   // Allocated size (track total across all allocations, and the min
and
>> max).
>>
>>   u64 total_size;
>>
>>   u32 min_size, max_size;
>>
>>   // Lifetime (track total across all allocations, and the min and
max).
>>
>>   u64 total_lifetime;
>>
>>   u32 min_lifetime, max_lifetime;
>>
>>   // Percent utilization of allocated 64-byte blocks (track total
>>
>>   // across all allocations, and the min and max). The utilization is
>>
>>   // defined as the percentage of 8-byte shadow counters corresponding
to
>>
>>   // the full allocation that are non-zero.
>>
>>   u64 total_percent_utilized;
>>
>>   u32 min_percent_utilized, max_percent_utilized;
>>
>>   // Allocation and deallocation timestamps from the most recent merge
>> into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_timestamp, dealloc_timestamp;
>>
>>   // Allocation and deallocation cpu ids from the most recent merge
into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>
>>   // Count of allocations at this stack context that had a different
>>
>>   // allocation and deallocation cpu id.
>>
>>   u32 num_migrated_cpu;
>>
>>   // Number of times the lifetime of the entry being merged had its
>> lifetime
>>
>>   // overlap with the previous entry merged with this stack context (by
>>
>>   // comparing the new alloc/dealloc timestamp with the one last
recorded
>> in
>>
>>   // the entry in the table.
>>
>>   u32 num_lifetime_overlaps;
>>
>>   // Number of times the alloc/dealloc cpu of the entry being merged
was
>> the
>>
>>   // same as that of the previous entry merged with this stack context
>>
>>   u32 num_same_alloc_cpu;
>>
>>   u32 num_same_dealloc_cpu;
>>
>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>> support and
>>
>>   // possibly IR changes.
>>
>>   u64 data_type_id;
>>
>> }
>> HIB Table
>>
>> The Heap Info Block Table, which is a multi-way associative cache,
holds
>> HIB objects from deallocated objects. It is indexed by the stack
allocation
>> context id from the chunk header, and currently utilizes a simple mod
with
>> a prime number close to a power of two as the hash (because of the way
the
>> stack context ids are assigned, a mod of a power of two performs very
>> poorly). Thus far, only 4-way associativity has been evaluated.
>>
>>
>>
>> HIB entries are added or merged into the HIB Table on each
deallocation.
>> If an entry with a matching stack alloc context id is found in the
Table,
>> the newly deallocated information is merged into the existing entry.
Each
>> HIB Table entry currently tracks the min, max and total value of the
>> various fields for use in computing and reporting the min, max and
average
>> when the Table is ultimately dumped.
>>
>>
>>
>> If no entry with a matching stack alloc context id is found, a new
entry
>> is created. If this causes an eviction, the evicted entry is dumped
>> immediately (by default to stderr, otherwise to a specified report
file).
>> Later post processing can merge dumped entries with the same stack
alloc
>> context id.
>> Initialization
>>
>>
>>
>> For ASAN, an __asan_init function initializes the memory allocation
>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>> global constructor to invoke it. The heap profiler prototype adds a new
>> __heapprof_init function, which performs heap profile specific
>> initialization, and the heap profile instrumentation pass calls this
new
>> init function instead by a generated global constructor. It currently
>> additionally invokes __asan_init since we are leveraging a modified
ASAN
>> runtime. Eventually, this should be changed to initialize refactored
common
>> support.
>>
>>
>>
>> Note that __asan init is also placed in the .preinit_array when it is
>> available, so it is invoked even earlier than global constructors.
>> Currently, it is not possible to do this for __heapprof_init, as it
calls
>> timespec_get in order to get a baseline timestamp (as described in the
>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>> baseline timestamp, in order to fit into 32 bits), and system calls
cannot
>> be made that early (dl_init is not complete). Since the constructor
>> priority is 1, it should be executed early enough that there are very
few
>> allocations before it runs, and likely the best solution is to simply
>> ignore any allocations before initialization.
>> Dumping
>>
>> For the prototype, the profile is dumped as text with a compact raw
>> format to limit its size. Ultimately it should be dumped in a more
compact
>> binary format (i.e. into a different section of the raw instrumentation
>> based profile, with llvm-profdata performing post-processing) which is
TBD.
>> HIB Dumping
>>
>> As noted earlier, HIB Table entries are created as memory is
deallocated.
>> At the end of the run (or whenever dumping is requested, discussed
later),
>> HIB entries need to be created for allocations that are still live.
>> Conveniently, the sanitizer allocator already contains a mechanism to
walk
>> through all chunks of memory it is tracking (ForEachChunk). The heap
>> profiler simply looks for all chunks with a chunk state of allocated,
and
>> creates a HIB the same as would be done on deallocation, adding each to
the
>> table.
>>
>>
>>
>> A HIB Table mechanism for printing each entry is then invoked.
>>
>>
>>
>> By default, the dumping occurs:
>>
>>    - on evictions
>>    - full table at exit (when the static Allocator object is
destructed)
>>
>>
>>
>> For running in a load testing scenario, we will want to add a mechanism
>> to provoke finalization (merging currently live allocations) and
dumping of
>> the HIB Table before exit. This would be similar to the
__llvm_profile_dump
>> facility used for normal PGO counter dumping.
>> Stack Trace Dumping
>>
>> There is existing support for dumping symbolized StackTrace objects. A
>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>> This new interface is invoked just after the HIB Table is dumped (on
exit
>> or via dumping interface).
>> Memory Map Dumping
>>
>> In cases where we may want to symbolize as a post processing step, we
may
>> need the memory map (from /proc/self/smaps). Specifically, this is
needed
>> to symbolize binaries using ASLR (Address Space Layout Randomization).
>> There is already support for reading this file and dumping it to the
>> specified report output file (DumpProcessMap()). This is invoked when
the
>> profile output file is initialized (HIB Table construction), so that
the
>> memory map is available at the top of the raw profile.
>> Current Status and Next Steps
>>
>>
>>
>> As mentioned earlier, I have a working prototype based on a simplified
>> stripped down version of ASAN. My current plan is to do the following:
>>
>>    1. Refactor out some of the shadow setup code common between ASAN
and
>>    HWASAN into sanitizer_common.
>>    2. Rework my prototype into a separate heapprof library in
>>    compiler-rt, using sanitizer_common support where possible, and send
>>    patches for review.
>>    3. Send patches for the heap profiler instrumentation pass and
>>    related clang options.
>>    4. Design/implement binary profile format
>>
>>
>>
>> --
>>
>> Teresa Johnson |
>>
>>  Software Engineer |
>>
>>  tejohnson at google.com |
>>
>>
>>
>
-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/8cadee89/attachment.html>

Mitch Phillips via llvm-dev

2020-Jul-06 18:48 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

> I'm not aware that -fsanitizer* options disable these, but I know in
ourenvironment we do disable frame pointer omission when setting up ASAN
builds, and I am arranging for heap profiling builds to do the same. Not
sure whether we want to do this within clang itself, would be interested in
Kostya's opinion. I can't see anywhere that we are disabling tail call
optimizations for ASAN though, but I might have missed it.

We don't force frame pointers to be emitted with -fsanitize=address at
least -- although we highly recommend it
<https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame
pointer unwinder is much faster than DWARF, particularly important for
stack collection on malloc/free.

On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi Wenlei,
>
> Thanks for the comments! David answered the first question, I do have some
> comments on the second one though.
> Teresa
>
> On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at
google.com>
> wrote:
>
>>
>>
>> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com>
wrote:
>>
>>> This sounds very useful. We’ve improved and used memoro
>>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory
profiling and
>>> analysis, and we are also looking for ways to leverage memory
profile for
>>> PGO/FDO. I think having a common profiling infrastructure for
analysis
>>> tooling as well as profile guided optimizations is good design, and
having
>>> it in LLVM is also helpful. Very interested in the tooling and
optimization
>>> that comes after the profiler.
>>>
>>>
>>>
>>> Two questions:
>>>
>>>    - How does the profiling overhead look? Is that similar to ASAN
>>>    overhead from what you’ve seen, which would be higher than PGO
>>>    instrumentation? Asking because I’m wondering if any PGO
training setup can
>>>    be used directly for the new heap profiling.
>>>
>>>
>> It is built on top of ASAN runtime, but the overhead can be made much
>> lower by using counter update consolidation -- all fields sharing the
same
>> shadow counter can be merged, and aggressive loop sinking/hoisting can
be
>> done.
>>
>> The goal is to integrate this with the PGO instrumentation. The PGO
>> instrumentation overhead can be further reduced with sampling technique
>> (Rong Xu has a patch to be submitted).
>>
>>
>>>    -
>>>    - I’m not familiar with how sanitizer handles stack trace, but
for
>>>    getting most accurate calling context (use FP rather than
dwarf), I guess
>>>    frame pointer omission and tail call opt etc. need to be turned
off? Is
>>>    that going to be implied by -fheapprof?
>>>
>>>
>> Kostya can provide detailed answers to these questions.
>>
>
> I'm not aware that -fsanitizer* options disable these, but I know in
our
> environment we do disable frame pointer omission when setting up ASAN
> builds, and I am arranging for heap profiling builds to do the same. Not
> sure whether we want to do this within clang itself, would be interested in
> Kostya's opinion. I can't see anywhere that we are disabling tail
call
> optimizations for ASAN though, but I might have missed it.
>
> Thanks,
> Teresa
>
>
>>
>> David
>>
>>>
>>>    -
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Wenlei
>>>
>>>
>>>
>>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on
behalf of Teresa
>>> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
>>> *Reply-To: *Teresa Johnson <tejohnson at google.com>
>>> *Date: *Wednesday, June 24, 2020 at 4:58 PM
>>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya
Serebryany <
>>> kcc at google.com>, Evgenii Stepanov <eugenis at
google.com>, Vitaly Buka <
>>> vitalybuka at google.com>
>>> *Cc: *David Li <davidxl at google.com>
>>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I've included an RFC for a heap profiler design I've been
working on in
>>> conjunction with David Li. Please send any questions or feedback.
For
>>> sanitizer folks, one area of feedback is on refactoring some of the
*ASAN
>>> shadow setup code (see the Shadow Memory section).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Teresa
>>>
>>>
>>>
>>> RFC: Sanitizer-based Heap Profiler
>>> Summary
>>>
>>> This document provides an overview of an LLVM Sanitizer-based heap
>>> profiler design.
>>> Motivation
>>>
>>> The objective of heap memory profiling is to collect critical
runtime
>>> information associated with heap memory references and information
on heap
>>> memory allocations. The profile information will be used first for
tooling,
>>> and subsequently to guide the compiler optimizer and allocation
runtime to
>>> layout heap objects with improved spatial locality. As a  result,
DTLB and
>>> cache utilization will be improved, and program IPC (performance)
will be
>>> increased due to reduced TLB and cache misses. More details on the
heap
>>> profile guided optimizations will be shared in the future.
>>> Overview
>>>
>>> The profiler is based on compiler inserted instrumentation of load
and
>>> store accesses, and utilizes runtime support to monitor heap
allocations
>>> and profile data. The target consumer of the heap memory profile
>>> information is initially tooling and ultimately automatic data
layout
>>> optimizations performed by the compiler and/or allocation runtime
(with the
>>> support of new allocation runtime APIs).
>>>
>>>
>>>
>>> Each memory address is mapped to Shadow Memory
>>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
>>> similar to the approach used by the Address Sanitizer
>>> <https://github.com/google/sanitizers/wiki/AddressSanitizer>
(ASAN).
>>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow,
the
>>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The
shadow
>>> location implements the profile counter (incremented on accesses to
the
>>> corresponding memory). This granularity was chosen to help avoid
counter
>>> overflow, but it may be possible to consider mapping 32-bytes to 4
bytes.
>>> To avoid aliasing of shadow memory for different allocations, we
must
>>> choose a minimum alignment carefully. As discussed further below,
we can
>>> attain a 32-byte minimum alignment, instead of a 64-byte alignment,
by
>>> storing necessary heap information for each allocation in a 32-byte
header
>>> block.
>>>
>>>
>>>
>>> The compiler instruments each load and store to increment the
associated
>>> shadow memory counter, in order to determine hotness.
>>>
>>>
>>>
>>> The heap profiler runtime is responsible for tracking allocations
and
>>> deallocations, including the stack at each allocation, and
information such
>>> as the allocation size and other statistics. I have implemented a
prototype
>>> built using a stripped down and modified version of ASAN, however
this will
>>> be a separate library utilizing sanitizer_common components.
>>> Compiler
>>>
>>> A simple HeapProfiler instrumentation pass instruments interesting
>>> memory accesses (loads, stores, atomics), with a simple load,
increment,
>>> store of the associated shadow memory location (computed via a mask
and
>>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of
the shadow
>>> offset). The handling is very similar to and based off of the ASAN
>>> instrumentation pass, with slightly different instrumentation code.
>>>
>>>
>>>
>>> Various techniques can be used to reduce the overhead, by
aggressively
>>> coalescing counter updates (e.g. given the 32-byte alignment,
accesses
>>> known to be in the same 32-byte block, or across possible aliases
since we
>>> don’t care about the dereferenced values).
>>>
>>>
>>>
>>> Additionally, the Clang driver needs to set up to link with the
runtime
>>> library, much as it does with the sanitizers.
>>>
>>>
>>>
>>> A -fheapprof option is added to enable the instrumentation pass and
>>> runtime library linking. Similar to -fprofile-generate, -fheapprof
will
>>> accept an argument specifying the directory in which to write the
profile.
>>> Runtime
>>>
>>> The heap profiler runtime is responsible for tracking and reporting
>>> information about heap allocations and accesses, aggregated by
allocation
>>> calling context. For example, the hotness, lifetime, and cpu
affinity.
>>>
>>>
>>>
>>> A new heapprof library will be created within compiler-rt. It will
>>> leverage support within sanitizer_common, which already contains
facilities
>>> like stack context tracking, needed by the heap profiler.
>>> Shadow Memory
>>>
>>> There are some basic facilities in sanitizer_common for mmap’ing
the
>>> shadow memory, but most of the existing setup lives in the ASAN and
HWASAN
>>> libraries. In the case of ASAN, there is support for both
statically
>>> assigned shadow offsets (the default on most platforms), and for
>>> dynamically assigned shadow memory (implemented for Windows and
currently
>>> also used for Android and iOS). According to kcc, recent
experiments show
>>> that the performance with a dynamic shadow is close to that with a
static
>>> mapping. In fact, that is the only approach currently used by
HWASAN. Given
>>> the simplicity, the heap profiler will be implemented with a
dynamic shadow
>>> as well.
>>>
>>>
>>>
>>> There are a number of functions in ASAN and HWASAN related to setup
of
>>> the shadow that are duplicated but very nearly identical, at least
for
>>> linux (which seems to be the only OS flavor currently supported for
>>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>>> FindDynamicShadowStart (in ASAN there is another nearly identical
copy in
>>> PremapShadow, used by Android, whereas in HW ASAN the premap
handling is
>>> already commoned with the non-premap handling). Rather than make
yet
>>> another copy of these mechanisms, I propose refactoring them into
>>> sanitizer_common versions. Like HWASAN, the initial version of the
heap
>>> profiler will be supported for linux only, but other OSes can be
added as
>>> needed similar to ASAN.
>>> StackTrace and StackDepot
>>>
>>> The sanitizer already contains support for obtaining and
representing a
>>> stack trace in a StackTrace object, and storing it in the
StackDepot which
>>> “efficiently stores huge amounts of stack traces”. This is in the
>>> sanitizer_common subdirectory and the support is shared by ASAN and
>>> ThreadSanitizer. The StackDepot is essentially an unbounded hash
table,
>>> where each StackTrace is assigned a unique id. ASAN stores this id
in the
>>> alloc_context_id field in each ChunkHeader (in the redzone
preceding each
>>> allocation). Additionally, there is support for symbolizing and
printing
>>> StackTrace objects.
>>> ChunkHeader
>>>
>>> The heap profiler needs to track several pieces of information for
each
>>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can
achieve
>>> a minimum of 32-byte alignment by holding this information in a
32-byte
>>> header block preceding each allocation.
>>>
>>>
>>>
>>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>>> contains information about the current allocation state, user
requested
>>> size, allocation and free thread ids, the allocation context id
>>> (representing the call stack at allocation, assigned by the
StackDepot as
>>> described above), and misc other bookkeeping. For heap profiling,
this will
>>> be converted to a 32-byte header block.
>>>
>>>
>>>
>>> Note that we could instead use the metadata section, similar to
other
>>> sanitizers, which is stored in a separate location. However, as
described
>>> above, storing the header block with each allocation enables
32-byte
>>> alignment without aliasing shadow counters for the same 64 bytes of
memory.
>>>
>>>
>>>
>>> In the prototype heap profiler implementation, the header contains
the
>>> following fields:
>>>
>>>
>>>
>>> // Should be 32 bytes
>>>
>>> struct ChunkHeader {
>>>
>>>   // 1-st 4 bytes
>>>
>>>   // Carry over from ASAN (available, allocated, quarantined). Will
be
>>>
>>>   // reduced to 1 bit (available or allocated).
>>>
>>>   u32 chunk_state       : 8;
>>>
>>>   // Carry over from ASAN. Used to determine the start of user
>>> allocation.
>>>
>>>   u32 from_memalign     : 1;
>>>
>>>   // 23 bits available
>>>
>>>
>>>
>>>   // 2-nd 4 bytes
>>>
>>>   // Carry over from ASAN (comment copied verbatim).
>>>
>>>   // This field is used for small sizes. For large sizes it is
equal to
>>>
>>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>>
>>>   // SecondaryAllocator's metadata.
>>>
>>>   u32 user_requested_size : 29;
>>>
>>>
>>>
>>>   // 3-rd 4 bytes
>>>
>>>   u32 cpu_id; // Allocation cpu id
>>>
>>>
>>>
>>>   // 4-th 4 bytes
>>>
>>>   // Allocation timestamp in ms from a baseline timestamp computed
at
>>>
>>>   // the start of profiling (to keep this within 32 bits).
>>>
>>>   u32 timestamp_ms;
>>>
>>>
>>>
>>>   // 5-th and 6-th 4 bytes
>>>
>>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>>
>>>   u64 alloc_context_id;
>>>
>>>
>>>
>>>   // 7-th and 8-th 4 bytes
>>>
>>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR
support.
>>>
>>>   u64 data_type_id; // hash of type name
>>>
>>> };
>>>
>>> As noted, the chunk state can be reduced to a single bit (no need
for
>>> quarantined memory in the heap profiler). The header contains a
placeholder
>>> for the data type hash, which is not yet implemented as it needs
>>> instrumentation and IR support.
>>> Heap Info Block (HIB)
>>>
>>> On a deallocation, information from the corresponding shadow
block(s)
>>> and header are recorded in a Heap Info Block (HIB) object. The
access count
>>> is computed from the shadow memory locations for the allocation, as
well as
>>> the percentage of accessed 64-byte blocks (i.e. the percentage of
non-zero
>>> 8-byte shadow locations for the whole allocation). Other
information such
>>> as the deallocation timestamp (for lifetime computation) and
deallocation
>>> cpu id (to determine migrations) are recorded along with the
information in
>>> the chunk header recorded on allocation.
>>>
>>>
>>>
>>> The prototyped HIB object tracks the following:
>>>
>>>
>>>
>>> struct HeapInfoBlock {
>>>
>>>   // Total allocations at this stack context
>>>
>>>   u32 alloc_count;
>>>
>>>   // Access count computed from all allocated 64-byte blocks (track
total
>>>
>>>   // across all allocations, and the min and max).
>>>
>>>   u64 total_access_count, min_access_count, max_access_count;
>>>
>>>   // Allocated size (track total across all allocations, and the
min and
>>> max).
>>>
>>>   u64 total_size;
>>>
>>>   u32 min_size, max_size;
>>>
>>>   // Lifetime (track total across all allocations, and the min and
max).
>>>
>>>   u64 total_lifetime;
>>>
>>>   u32 min_lifetime, max_lifetime;
>>>
>>>   // Percent utilization of allocated 64-byte blocks (track total
>>>
>>>   // across all allocations, and the min and max). The utilization
is
>>>
>>>   // defined as the percentage of 8-byte shadow counters
corresponding to
>>>
>>>   // the full allocation that are non-zero.
>>>
>>>   u64 total_percent_utilized;
>>>
>>>   u32 min_percent_utilized, max_percent_utilized;
>>>
>>>   // Allocation and deallocation timestamps from the most recent
merge
>>> into
>>>
>>>   // the table with this stack context.
>>>
>>>   u32 alloc_timestamp, dealloc_timestamp;
>>>
>>>   // Allocation and deallocation cpu ids from the most recent merge
into
>>>
>>>   // the table with this stack context.
>>>
>>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>>
>>>   // Count of allocations at this stack context that had a
different
>>>
>>>   // allocation and deallocation cpu id.
>>>
>>>   u32 num_migrated_cpu;
>>>
>>>   // Number of times the lifetime of the entry being merged had its
>>> lifetime
>>>
>>>   // overlap with the previous entry merged with this stack context
(by
>>>
>>>   // comparing the new alloc/dealloc timestamp with the one last
>>> recorded in
>>>
>>>   // the entry in the table.
>>>
>>>   u32 num_lifetime_overlaps;
>>>
>>>   // Number of times the alloc/dealloc cpu of the entry being
merged was
>>> the
>>>
>>>   // same as that of the previous entry merged with this stack
context
>>>
>>>   u32 num_same_alloc_cpu;
>>>
>>>   u32 num_same_dealloc_cpu;
>>>
>>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>>> support and
>>>
>>>   // possibly IR changes.
>>>
>>>   u64 data_type_id;
>>>
>>> }
>>> HIB Table
>>>
>>> The Heap Info Block Table, which is a multi-way associative cache,
holds
>>> HIB objects from deallocated objects. It is indexed by the stack
allocation
>>> context id from the chunk header, and currently utilizes a simple
mod with
>>> a prime number close to a power of two as the hash (because of the
way the
>>> stack context ids are assigned, a mod of a power of two performs
very
>>> poorly). Thus far, only 4-way associativity has been evaluated.
>>>
>>>
>>>
>>> HIB entries are added or merged into the HIB Table on each
deallocation.
>>> If an entry with a matching stack alloc context id is found in the
Table,
>>> the newly deallocated information is merged into the existing
entry. Each
>>> HIB Table entry currently tracks the min, max and total value of
the
>>> various fields for use in computing and reporting the min, max and
average
>>> when the Table is ultimately dumped.
>>>
>>>
>>>
>>> If no entry with a matching stack alloc context id is found, a new
entry
>>> is created. If this causes an eviction, the evicted entry is dumped
>>> immediately (by default to stderr, otherwise to a specified report
file).
>>> Later post processing can merge dumped entries with the same stack
alloc
>>> context id.
>>> Initialization
>>>
>>>
>>>
>>> For ASAN, an __asan_init function initializes the memory allocation
>>> tracking support, and the ASAN instrumentation pass in LLVM creates
a
>>> global constructor to invoke it. The heap profiler prototype adds a
new
>>> __heapprof_init function, which performs heap profile specific
>>> initialization, and the heap profile instrumentation pass calls
this new
>>> init function instead by a generated global constructor. It
currently
>>> additionally invokes __asan_init since we are leveraging a modified
ASAN
>>> runtime. Eventually, this should be changed to initialize
refactored common
>>> support.
>>>
>>>
>>>
>>> Note that __asan init is also placed in the .preinit_array when it
is
>>> available, so it is invoked even earlier than global constructors.
>>> Currently, it is not possible to do this for __heapprof_init, as it
calls
>>> timespec_get in order to get a baseline timestamp (as described in
the
>>> ChunkHeader comments the timestamps (ms) are actually offsets from
the
>>> baseline timestamp, in order to fit into 32 bits), and system calls
cannot
>>> be made that early (dl_init is not complete). Since the constructor
>>> priority is 1, it should be executed early enough that there are
very few
>>> allocations before it runs, and likely the best solution is to
simply
>>> ignore any allocations before initialization.
>>> Dumping
>>>
>>> For the prototype, the profile is dumped as text with a compact raw
>>> format to limit its size. Ultimately it should be dumped in a more
compact
>>> binary format (i.e. into a different section of the raw
instrumentation
>>> based profile, with llvm-profdata performing post-processing) which
is TBD.
>>> HIB Dumping
>>>
>>> As noted earlier, HIB Table entries are created as memory is
>>> deallocated. At the end of the run (or whenever dumping is
requested,
>>> discussed later), HIB entries need to be created for allocations
that are
>>> still live. Conveniently, the sanitizer allocator already contains
a
>>> mechanism to walk through all chunks of memory it is tracking (
>>> ForEachChunk). The heap profiler simply looks for all chunks with a
>>> chunk state of allocated, and creates a HIB the same as would be
done on
>>> deallocation, adding each to the table.
>>>
>>>
>>>
>>> A HIB Table mechanism for printing each entry is then invoked.
>>>
>>>
>>>
>>> By default, the dumping occurs:
>>>
>>>    - on evictions
>>>    - full table at exit (when the static Allocator object is
destructed)
>>>
>>>
>>>
>>> For running in a load testing scenario, we will want to add a
mechanism
>>> to provoke finalization (merging currently live allocations) and
dumping of
>>> the HIB Table before exit. This would be similar to the
__llvm_profile_dump
>>> facility used for normal PGO counter dumping.
>>> Stack Trace Dumping
>>>
>>> There is existing support for dumping symbolized StackTrace
objects. A
>>> wrapper to dump all StackTrace objects in the StackDepot will be
added.
>>> This new interface is invoked just after the HIB Table is dumped
(on exit
>>> or via dumping interface).
>>> Memory Map Dumping
>>>
>>> In cases where we may want to symbolize as a post processing step,
we
>>> may need the memory map (from /proc/self/smaps). Specifically, this
is
>>> needed to symbolize binaries using ASLR (Address Space Layout
>>> Randomization). There is already support for reading this file and
dumping
>>> it to the specified report output file (DumpProcessMap()). This is
invoked
>>> when the profile output file is initialized (HIB Table
construction), so
>>> that the memory map is available at the top of the raw profile.
>>> Current Status and Next Steps
>>>
>>>
>>>
>>> As mentioned earlier, I have a working prototype based on a
simplified
>>> stripped down version of ASAN. My current plan is to do the
following:
>>>
>>>    1. Refactor out some of the shadow setup code common between
ASAN
>>>    and HWASAN into sanitizer_common.
>>>    2. Rework my prototype into a separate heapprof library in
>>>    compiler-rt, using sanitizer_common support where possible, and
send
>>>    patches for review.
>>>    3. Send patches for the heap profiler instrumentation pass and
>>>    related clang options.
>>>    4. Design/implement binary profile format
>>>
>>>
>>>
>>> --
>>>
>>> Teresa Johnson |
>>>
>>>  Software Engineer |
>>>
>>>  tejohnson at google.com |
>>>
>>>
>>>
>>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/e374fa7e/attachment.html>

llvm dev - Jul 2020 - RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler