thr3ads.net - llvm dev - [llvm-dev] RFC: Sanitizer-based Heap Profiler [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Teresa Johnson via llvm-dev

2020-Jun-24 23:57 UTC

[llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi all,

I've included an RFC for a heap profiler design I've been working on in
conjunction with David Li. Please send any questions or feedback. For
sanitizer folks, one area of feedback is on refactoring some of the *ASAN
shadow setup code (see the Shadow Memory section).

Thanks,
Teresa


RFC: Sanitizer-based Heap Profiler
Summary

This document provides an overview of an LLVM Sanitizer-based heap profiler
design.
Motivation

The objective of heap memory profiling is to collect critical runtime
information associated with heap memory references and information on heap
memory allocations. The profile information will be used first for tooling,
and subsequently to guide the compiler optimizer and allocation runtime to
layout heap objects with improved spatial locality. As a  result, DTLB and
cache utilization will be improved, and program IPC (performance) will be
increased due to reduced TLB and cache misses. More details on the heap
profile guided optimizations will be shared in the future.
Overview

The profiler is based on compiler inserted instrumentation of load and
store accesses, and utilizes runtime support to monitor heap allocations
and profile data. The target consumer of the heap memory profile
information is initially tooling and ultimately automatic data layout
optimizations performed by the compiler and/or allocation runtime (with the
support of new allocation runtime APIs).

Each memory address is mapped to Shadow Memory
<https://en.wikipedia.org/wiki/Shadow_memory>, similar to the approach
used
by the Address Sanitizer
<https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
Unlike
ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the heap
profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow location
implements the profile counter (incremented on accesses to the
corresponding memory). This granularity was chosen to help avoid counter
overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
To avoid aliasing of shadow memory for different allocations, we must
choose a minimum alignment carefully. As discussed further below, we can
attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
storing necessary heap information for each allocation in a 32-byte header
block.

The compiler instruments each load and store to increment the associated
shadow memory counter, in order to determine hotness.

The heap profiler runtime is responsible for tracking allocations and
deallocations, including the stack at each allocation, and information such
as the allocation size and other statistics. I have implemented a prototype
built using a stripped down and modified version of ASAN, however this will
be a separate library utilizing sanitizer_common components.
Compiler

A simple HeapProfiler instrumentation pass instruments interesting memory
accesses (loads, stores, atomics), with a simple load, increment, store of
the associated shadow memory location (computed via a mask and shift to do
the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
The handling is very similar to and based off of the ASAN instrumentation
pass, with slightly different instrumentation code.

Various techniques can be used to reduce the overhead, by aggressively
coalescing counter updates (e.g. given the 32-byte alignment, accesses
known to be in the same 32-byte block, or across possible aliases since we
don’t care about the dereferenced values).

Additionally, the Clang driver needs to set up to link with the runtime
library, much as it does with the sanitizers.

A -fheapprof option is added to enable the instrumentation pass and runtime
library linking. Similar to -fprofile-generate, -fheapprof will accept an
argument specifying the directory in which to write the profile.
Runtime

The heap profiler runtime is responsible for tracking and reporting
information about heap allocations and accesses, aggregated by allocation
calling context. For example, the hotness, lifetime, and cpu affinity.

A new heapprof library will be created within compiler-rt. It will leverage
support within sanitizer_common, which already contains facilities like
stack context tracking, needed by the heap profiler.
Shadow Memory

There are some basic facilities in sanitizer_common for mmap’ing the shadow
memory, but most of the existing setup lives in the ASAN and HWASAN
libraries. In the case of ASAN, there is support for both statically
assigned shadow offsets (the default on most platforms), and for
dynamically assigned shadow memory (implemented for Windows and currently
also used for Android and iOS). According to kcc, recent experiments show
that the performance with a dynamic shadow is close to that with a static
mapping. In fact, that is the only approach currently used by HWASAN. Given
the simplicity, the heap profiler will be implemented with a dynamic shadow
as well.

There are a number of functions in ASAN and HWASAN related to setup of the
shadow that are duplicated but very nearly identical, at least for linux
(which seems to be the only OS flavor currently supported for HWASAN). E.g.
ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
there is another nearly identical copy in PremapShadow, used by Android,
whereas in HW ASAN the premap handling is already commoned with the
non-premap handling). Rather than make yet another copy of these
mechanisms, I propose refactoring them into sanitizer_common versions. Like
HWASAN, the initial version of the heap profiler will be supported for
linux only, but other OSes can be added as needed similar to ASAN.
StackTrace and StackDepot

The sanitizer already contains support for obtaining and representing a
stack trace in a StackTrace object, and storing it in the StackDepot which
“efficiently stores huge amounts of stack traces”. This is in the
sanitizer_common subdirectory and the support is shared by ASAN and
ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
where each StackTrace is assigned a unique id. ASAN stores this id in the
alloc_context_id field in each ChunkHeader (in the redzone preceding each
allocation). Additionally, there is support for symbolizing and printing
StackTrace objects.
ChunkHeader

The heap profiler needs to track several pieces of information for each
allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
a minimum of 32-byte alignment by holding this information in a 32-byte
header block preceding each allocation.

In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
information about the current allocation state, user requested size,
allocation and free thread ids, the allocation context id (representing the
call stack at allocation, assigned by the StackDepot as described above),
and misc other bookkeeping. For heap profiling, this will be converted to a
32-byte header block.

Note that we could instead use the metadata section, similar to other
sanitizers, which is stored in a separate location. However, as described
above, storing the header block with each allocation enables 32-byte
alignment without aliasing shadow counters for the same 64 bytes of memory.

In the prototype heap profiler implementation, the header contains the
following fields:

// Should be 32 bytes

struct ChunkHeader {

  // 1-st 4 bytes

  // Carry over from ASAN (available, allocated, quarantined). Will be

  // reduced to 1 bit (available or allocated).

  u32 chunk_state       : 8;

  // Carry over from ASAN. Used to determine the start of user allocation.

  u32 from_memalign     : 1;

  // 23 bits available

  // 2-nd 4 bytes

  // Carry over from ASAN (comment copied verbatim).

  // This field is used for small sizes. For large sizes it is equal to

  // SizeClassMap::kMaxSize and the actual size is stored in the

  // SecondaryAllocator's metadata.

  u32 user_requested_size : 29;

  // 3-rd 4 bytes

  u32 cpu_id; // Allocation cpu id

  // 4-th 4 bytes

  // Allocation timestamp in ms from a baseline timestamp computed at

  // the start of profiling (to keep this within 32 bits).

  u32 timestamp_ms;

  // 5-th and 6-th 4 bytes

  // Carry over from ASAN. Used to identify allocation stack trace.

  u64 alloc_context_id;

  // 7-th and 8-th 4 bytes

  // UNIMPLEMENTED in prototype - needs instrumentation and IR support.

  u64 data_type_id; // hash of type name

};

As noted, the chunk state can be reduced to a single bit (no need for
quarantined memory in the heap profiler). The header contains a placeholder
for the data type hash, which is not yet implemented as it needs
instrumentation and IR support.
Heap Info Block (HIB)

On a deallocation, information from the corresponding shadow block(s) and
header are recorded in a Heap Info Block (HIB) object. The access count is
computed from the shadow memory locations for the allocation, as well as
the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
8-byte shadow locations for the whole allocation). Other information such
as the deallocation timestamp (for lifetime computation) and deallocation
cpu id (to determine migrations) are recorded along with the information in
the chunk header recorded on allocation.

The prototyped HIB object tracks the following:

struct HeapInfoBlock {

  // Total allocations at this stack context

  u32 alloc_count;

  // Access count computed from all allocated 64-byte blocks (track total

  // across all allocations, and the min and max).

  u64 total_access_count, min_access_count, max_access_count;

  // Allocated size (track total across all allocations, and the min and
max).

  u64 total_size;

  u32 min_size, max_size;

  // Lifetime (track total across all allocations, and the min and max).

  u64 total_lifetime;

  u32 min_lifetime, max_lifetime;

  // Percent utilization of allocated 64-byte blocks (track total

  // across all allocations, and the min and max). The utilization is

  // defined as the percentage of 8-byte shadow counters corresponding to

  // the full allocation that are non-zero.

  u64 total_percent_utilized;

  u32 min_percent_utilized, max_percent_utilized;

  // Allocation and deallocation timestamps from the most recent merge into

  // the table with this stack context.

  u32 alloc_timestamp, dealloc_timestamp;

  // Allocation and deallocation cpu ids from the most recent merge into

  // the table with this stack context.

  u32 alloc_cpu_id, dealloc_cpu_id;

  // Count of allocations at this stack context that had a different

  // allocation and deallocation cpu id.

  u32 num_migrated_cpu;

  // Number of times the lifetime of the entry being merged had its lifetime

  // overlap with the previous entry merged with this stack context (by

  // comparing the new alloc/dealloc timestamp with the one last recorded in

  // the entry in the table.

  u32 num_lifetime_overlaps;

  // Number of times the alloc/dealloc cpu of the entry being merged was the

  // same as that of the previous entry merged with this stack context

  u32 num_same_alloc_cpu;

  u32 num_same_dealloc_cpu;

  // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
and

  // possibly IR changes.

  u64 data_type_id;

}
HIB Table

The Heap Info Block Table, which is a multi-way associative cache, holds
HIB objects from deallocated objects. It is indexed by the stack allocation
context id from the chunk header, and currently utilizes a simple mod with
a prime number close to a power of two as the hash (because of the way the
stack context ids are assigned, a mod of a power of two performs very
poorly). Thus far, only 4-way associativity has been evaluated.

HIB entries are added or merged into the HIB Table on each deallocation. If
an entry with a matching stack alloc context id is found in the Table, the
newly deallocated information is merged into the existing entry. Each HIB
Table entry currently tracks the min, max and total value of the various
fields for use in computing and reporting the min, max and average when the
Table is ultimately dumped.

If no entry with a matching stack alloc context id is found, a new entry is
created. If this causes an eviction, the evicted entry is dumped
immediately (by default to stderr, otherwise to a specified report file).
Later post processing can merge dumped entries with the same stack alloc
context id.
Initialization

For ASAN, an __asan_init function initializes the memory allocation
tracking support, and the ASAN instrumentation pass in LLVM creates a
global constructor to invoke it. The heap profiler prototype adds a new
__heapprof_init function, which performs heap profile specific
initialization, and the heap profile instrumentation pass calls this new
init function instead by a generated global constructor. It currently
additionally invokes __asan_init since we are leveraging a modified ASAN
runtime. Eventually, this should be changed to initialize refactored common
support.

Note that __asan init is also placed in the .preinit_array when it is
available, so it is invoked even earlier than global constructors.
Currently, it is not possible to do this for __heapprof_init, as it calls
timespec_get in order to get a baseline timestamp (as described in the
ChunkHeader comments the timestamps (ms) are actually offsets from the
baseline timestamp, in order to fit into 32 bits), and system calls cannot
be made that early (dl_init is not complete). Since the constructor
priority is 1, it should be executed early enough that there are very few
allocations before it runs, and likely the best solution is to simply
ignore any allocations before initialization.
Dumping

For the prototype, the profile is dumped as text with a compact raw format
to limit its size. Ultimately it should be dumped in a more compact binary
format (i.e. into a different section of the raw instrumentation based
profile, with llvm-profdata performing post-processing) which is TBD.
HIB Dumping

As noted earlier, HIB Table entries are created as memory is deallocated.
At the end of the run (or whenever dumping is requested, discussed later),
HIB entries need to be created for allocations that are still live.
Conveniently, the sanitizer allocator already contains a mechanism to walk
through all chunks of memory it is tracking (ForEachChunk). The heap
profiler simply looks for all chunks with a chunk state of allocated, and
creates a HIB the same as would be done on deallocation, adding each to the
table.

A HIB Table mechanism for printing each entry is then invoked.

By default, the dumping occurs:

   -

   on evictions
   -

   full table at exit (when the static Allocator object is destructed)


For running in a load testing scenario, we will want to add a mechanism to
provoke finalization (merging currently live allocations) and dumping of
the HIB Table before exit. This would be similar to the __llvm_profile_dump
facility used for normal PGO counter dumping.
Stack Trace Dumping

There is existing support for dumping symbolized StackTrace objects. A
wrapper to dump all StackTrace objects in the StackDepot will be added.
This new interface is invoked just after the HIB Table is dumped (on exit
or via dumping interface).
Memory Map Dumping

In cases where we may want to symbolize as a post processing step, we may
need the memory map (from /proc/self/smaps). Specifically, this is needed
to symbolize binaries using ASLR (Address Space Layout Randomization).
There is already support for reading this file and dumping it to the
specified report output file (DumpProcessMap()). This is invoked when the
profile output file is initialized (HIB Table construction), so that the
memory map is available at the top of the raw profile.
Current Status and Next Steps

As mentioned earlier, I have a working prototype based on a simplified
stripped down version of ASAN. My current plan is to do the following:

   1.

   Refactor out some of the shadow setup code common between ASAN and
   HWASAN into sanitizer_common.
   2.

   Rework my prototype into a separate heapprof library in compiler-rt,
   using sanitizer_common support where possible, and send patches for review.
   3.

   Send patches for the heap profiler instrumentation pass and related
   clang options.
   4.

   Design/implement binary profile format


-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200624/de55b6e6/attachment.html>

Hal Finkel via llvm-dev

2020-Jun-25 01:26 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi, Teresa,

This looks really useful.

It seems like this general infrastructure could be useful for diagnosing 
places where we have a lot of false sharing too. This could be between 
cores, sockets, or devices. Looking forward, especially with HMM and 
support under Linux for transparent unified memory between CPUs and 
accelerators, I anticipate we'll end up looking for places where some 
data accesses are causing undesired page migrations. Between cores and 
sockets, the interesting granularity is probably a cache line (and it 
looks like this is what you have anyway). This seems similar to me, the 
only difference being that we need to increment the count, not on every 
access, but only when the relevant entity id is different from the last 
one that accessed the relevant memory region. I'm not saying that we 
need to support this, but I'm curious what you think about this 
potential use case.

  -Hal

On 6/24/20 6:57 PM, Teresa Johnson via llvm-dev wrote:> Hi all,
>
> I've included an RFC for a heap profiler design I've been working
on
> in conjunction with David Li. Please send any questions or feedback. 
> For sanitizer folks, one area of feedback is on refactoring some of 
> the *ASAN shadow setup code (see the Shadow Memory section).
>
> Thanks,
> Teresa
>
> RFC: Sanitizer-based Heap Profiler
>
>
>   Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap 
> profiler design.
>
>
>   Motivation
>
> The objective of heap memory profiling is to collect critical runtime 
> information associated with heap memory references and information on 
> heap memory allocations. The profile information will be used first 
> for tooling, and subsequently to guide the compiler optimizer and 
> allocation runtime to layout heap objects with improved spatial 
> locality. As a  result, DTLB and cache utilization will be improved, 
> and program IPC (performance) will be increased due to reduced TLB and 
> cache misses. More details on the heap profile guided optimizations 
> will be shared in the future.
>
>
>   Overview
>
> The profiler is based on compiler inserted instrumentation of load and 
> store accesses, and utilizes runtime support to monitor heap 
> allocations and profile data. The target consumer of the heap memory 
> profile information is initially tooling and ultimately automatic data 
> layout optimizations performed by the compiler and/or allocation 
> runtime (with the support of new allocation runtime APIs).
>
>
> Each memory address is mapped to Shadow Memory 
> <https://en.wikipedia.org/wiki/Shadow_memory>, similar to the
approach
> used by the Address Sanitizer 
> <https://github.com/google/sanitizers/wiki/AddressSanitizer>(ASAN). 
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, 
> the heap profiler maps 64 bytes of memory to 8 bytes of shadow. The 
> shadow location implements the profile counter (incremented on 
> accesses to the corresponding memory). This granularity was chosen to 
> help avoid counter overflow, but it may be possible to consider 
> mapping 32-bytes to 4 bytes. To avoid aliasing of shadow memory for 
> different allocations, we must choose a minimum alignment carefully. 
> As discussed further below, we can attain a 32-byte minimum alignment, 
> instead of a 64-byte alignment, by storing necessary heap information 
> for each allocation in a 32-byte header block.
>
>
> The compiler instruments each load and store to increment the 
> associated shadow memory counter, in order to determine hotness.
>
>
> The heap profiler runtime is responsible for tracking allocations and 
> deallocations, including the stack at each allocation, and information 
> such as the allocation size and other statistics. I have implemented a 
> prototype built using a stripped down and modified version of ASAN, 
> however this will be a separate library utilizing sanitizer_common 
> components.
>
>
>     Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting 
> memory accesses (loads, stores, atomics), with a simple load, 
> increment, store of the associated shadow memory location (computed 
> via a mask and shift to do the mapping of 64 bytes to 8 byte shadow, 
> and add of the shadow offset). The handling is very similar to and 
> based off of the ASAN instrumentation pass, with slightly different 
> instrumentation code.
>
>
> Various techniques can be used to reduce the overhead, by aggressively 
> coalescing counter updates (e.g. given the 32-byte alignment, accesses 
> known to be in the same 32-byte block, or across possible aliases 
> since we don’t care about the dereferenced values).
>
>
> Additionally, the Clang driver needs to set up to link with the 
> runtime library, much as it does with the sanitizers.
>
>
> A -fheapprofoption is added to enable the instrumentation pass and 
> runtime library linking. Similar to -fprofile-generate, -fheapprofwill 
> accept an argument specifying the directory in which to write the profile.
>
>
>     Runtime
>
> The heap profiler runtime is responsible for tracking and reporting 
> information about heap allocations and accesses, aggregated by 
> allocation calling context. For example, the hotness, lifetime, and 
> cpu affinity.
>
>
> A new heapprof library will be created within compiler-rt. It will 
> leverage support within sanitizer_common, which already contains 
> facilities like stack context tracking, needed by the heap profiler.
>
>
>       Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the 
> shadow memory, but most of the existing setup lives in the ASAN and 
> HWASAN libraries. In the case of ASAN, there is support for both 
> statically assigned shadow offsets (the default on most platforms), 
> and for dynamically assigned shadow memory (implemented for Windows 
> and currently also used for Android and iOS). According to kcc, recent 
> experiments show that the performance with a dynamic shadow is close 
> to that with a static mapping. In fact, that is the only approach 
> currently used by HWASAN. Given the simplicity, the heap profiler will 
> be implemented with a dynamic shadow as well.
>
>
> There are a number of functions in ASAN and HWASAN related to setup of 
> the shadow that are duplicated but very nearly identical, at least for 
> linux (which seems to be the only OS flavor currently supported for 
> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and 
> FindDynamicShadowStart (in ASAN there is another nearly identical copy 
> in PremapShadow, used by Android, whereas in HW ASAN the premap 
> handling is already commoned with the non-premap handling). Rather 
> than make yet another copy of these mechanisms, I propose refactoring 
> them into sanitizer_common versions. Like HWASAN, the initial version 
> of the heap profiler will be supported for linux only, but other OSes 
> can be added as needed similar to ASAN.
>
>
>       StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing 
> a stack trace in a StackTrace object, and storing it in the StackDepot 
> which “efficiently stores huge amounts of stack traces”. This is in 
> the sanitizer_common subdirectory and the support is shared by ASAN 
> and ThreadSanitizer. The StackDepot is essentially an unbounded hash 
> table, where each StackTrace is assigned a unique id. ASAN stores this 
> id in the alloc_context_id field in each ChunkHeader (in the redzone 
> preceding each allocation). Additionally, there is support for 
> symbolizing and printing StackTrace objects.
>
>
>       ChunkHeader
>
> The heap profiler needs to track several pieces of information for 
> each allocation. Given the mapping of 64-bytes to 8-bytes shadow, we 
> can achieve a minimum of 32-byte alignment by holding this information 
> in a 32-byte header block preceding each allocation.
>
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It 
> contains information about the current allocation state, user 
> requested size, allocation and free thread ids, the allocation context 
> id (representing the call stack at allocation, assigned by the 
> StackDepot as described above), and misc other bookkeeping. For heap 
> profiling, this will be converted to a 32-byte header block.
>
>
> Note that we could instead use the metadata section, similar to other 
> sanitizers, which is stored in a separate location. However, as 
> described above, storing the header block with each allocation enables 
> 32-byte alignment without aliasing shadow counters for the same 64 
> bytes of memory.
>
>
> In the prototype heap profiler implementation, the header contains the 
> following fields:
>
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for 
> quarantined memory in the heap profiler). The header contains a 
> placeholder for the data type hash, which is not yet implemented as it 
> needs instrumentation and IR support.
>
>
>       Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) 
> and header are recorded in a Heap Info Block (HIB) object. The access 
> count is computed from the shadow memory locations for the allocation, 
> as well as the percentage of accessed 64-byte blocks (i.e. the 
> percentage of non-zero 8-byte shadow locations for the whole 
> allocation). Other information such as the deallocation timestamp (for 
> lifetime computation) and deallocation cpu id (to determine 
> migrations) are recorded along with the information in the chunk 
> header recorded on allocation.
>
>
> The prototyped HIB object tracks the following:
>
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min 
> and max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge 
> into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its 
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last 
> recorded in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged 
> was the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation 
> support and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
>
>
>       HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, 
> holds HIB objects from deallocated objects. It is indexed by the stack 
> allocation context id from the chunk header, and currently utilizes a 
> simple mod with a prime number close to a power of two as the hash 
> (because of the way the stack context ids are assigned, a mod of a 
> power of two performs very poorly). Thus far, only 4-way associativity 
> has been evaluated.
>
>
> HIB entries are added or merged into the HIB Table on each 
> deallocation. If an entry with a matching stack alloc context id is 
> found in the Table, the newly deallocated information is merged into 
> the existing entry. Each HIB Table entry currently tracks the min, max 
> and total value of the various fields for use in computing and 
> reporting the min, max and average when the Table is ultimately dumped.
>
>
> If no entry with a matching stack alloc context id is found, a new 
> entry is created. If this causes an eviction, the evicted entry is 
> dumped immediately (by default to stderr, otherwise to a specified 
> report file). Later post processing can merge dumped entries with the 
> same stack alloc context id.
>
>
>       Initialization
>
>
> For ASAN, an __asan_init function initializes the memory allocation 
> tracking support, and the ASAN instrumentation pass in LLVM creates a 
> global constructor to invoke it. The heap profiler prototype adds a 
> new __heapprof_init function, which performs heap profile specific 
> initialization, and the heap profile instrumentation pass calls this 
> new init function instead by a generated global constructor. It 
> currently additionally invokes __asan_init since we are leveraging a 
> modified ASAN runtime. Eventually, this should be changed to 
> initialize refactored common support.
>
>
> Note that __asan init is also placed in the .preinit_array when it is 
> available, so it is invoked even earlier than global constructors. 
> Currently, it is not possible to do this for __heapprof_init, as it 
> calls timespec_get in order to get a baseline timestamp (as described 
> in the ChunkHeader comments the timestamps (ms) are actually offsets 
> from the baseline timestamp, in order to fit into 32 bits), and system 
> calls cannot be made that early (dl_init is not complete). Since the 
> constructor priority is 1, it should be executed early enough that 
> there are very few allocations before it runs, and likely the best 
> solution is to simply ignore any allocations before initialization.
>
>
>       Dumping
>
> For the prototype, the profile is dumped as text with a compact raw 
> format to limit its size. Ultimately it should be dumped in a more 
> compact binary format (i.e. into a different section of the raw 
> instrumentation based profile, with llvm-profdata performing 
> post-processing) which is TBD.
>
>
>         HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is 
> deallocated. At the end of the run (or whenever dumping is requested, 
> discussed later), HIB entries need to be created for allocations that 
> are still live. Conveniently, the sanitizer allocator already contains 
> a mechanism to walk through all chunks of memory it is tracking 
> (ForEachChunk). The heap profiler simply looks for all chunks with a 
> chunk state of allocated, and creates a HIB the same as would be done 
> on deallocation, adding each to the table.
>
>
> A HIB Table mechanism for printing each entry is then invoked.
>
>
> By default, the dumping occurs:
>
>  *
>
>     on evictions
>
>  *
>
>     full table at exit (when the static Allocator object is destructed)
>
>
> For running in a load testing scenario, we will want to add a 
> mechanism to provoke finalization (merging currently live allocations) 
> and dumping of the HIB Table before exit. This would be similar to the 
> __llvm_profile_dump facility used for normal PGO counter dumping.
>
>
>         Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A 
> wrapper to dump all StackTrace objects in the StackDepot will be 
> added. This new interface is invoked just after the HIB Table is 
> dumped (on exit or via dumping interface).
>
>
>         Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we 
> may need the memory map (from /proc/self/smaps). Specifically, this is 
> needed to symbolize binaries using ASLR (Address Space Layout 
> Randomization). There is already support for reading this file and 
> dumping it to the specified report output file (DumpProcessMap()). 
> This is invoked when the profile output file is initialized (HIB Table 
> construction), so that the memory map is available at the top of the 
> raw profile.
>
>
>   Current Status and Next Steps
>
>
> As mentioned earlier, I have a working prototype based on a simplified 
> stripped down version of ASAN. My current plan is to do the following:
>
> 1.
>
>     Refactor out some of the shadow setup code common between ASAN and
>     HWASAN into sanitizer_common.
>
> 2.
>
>     Rework my prototype into a separate heapprof library in
>     compiler-rt, using sanitizer_common support where possible, and
>     send patches for review.
>
> 3.
>
>     Send patches for the heap profiler instrumentation pass and
>     related clang options.
>
> 4.
>
>     Design/implement binary profile format
>
>
> -- 
> Teresa Johnson | 	 Software Engineer | 	tejohnson at google.com 
> <mailto:tejohnson at google.com> | 	
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200624/1dab1801/attachment-0001.html>

Xinliang David Li via llvm-dev

2020-Jun-25 02:33 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Wed, Jun 24, 2020 at 6:26 PM Hal Finkel <hfinkel at anl.gov> wrote:
> Hi, Teresa,
>
> This looks really useful.
>
> It seems like this general infrastructure could be useful for diagnosing
> places where we have a lot of false sharing too.
>
This is mostly already covered in the current proposal -- except that
tracking granularity can not tell the difference between true vs false
sharing. The rich contexts provided, however, will enable users to quickly
identify if there is  a real issue.
> This could be between cores, sockets, or devices. Looking forward,
> especially with HMM and support under Linux for transparent unified memory
> between CPUs and accelerators, I anticipate we'll end up looking for
places
> where some data accesses are causing undesired page migrations. Between
> cores and sockets, the interesting granularity is probably a cache line
> (and it looks like this is what you have anyway). This seems similar to me,
> the only difference being that we need to increment the count, not on every
> access, but only when the relevant entity id is different from the last one
> that accessed the relevant memory region.
>
This can be controlled under an instrumentation time option, I think.

David



> I'm not saying that we need to support this, but I'm curious what
you
> think about this potential use case.
>
>  -Hal
> On 6/24/20 6:57 PM, Teresa Johnson via llvm-dev wrote:
>
> Hi all,
>
> I've included an RFC for a heap profiler design I've been working
on in
> conjunction with David Li. Please send any questions or feedback. For
> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
> shadow setup code (see the Shadow Memory section).
>
> Thanks,
> Teresa
>
>
> RFC: Sanitizer-based Heap Profiler
> Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap
> profiler design.
> Motivation
>
> The objective of heap memory profiling is to collect critical runtime
> information associated with heap memory references and information on heap
> memory allocations. The profile information will be used first for tooling,
> and subsequently to guide the compiler optimizer and allocation runtime to
> layout heap objects with improved spatial locality. As a  result, DTLB and
> cache utilization will be improved, and program IPC (performance) will be
> increased due to reduced TLB and cache misses. More details on the heap
> profile guided optimizations will be shared in the future.
> Overview
>
> The profiler is based on compiler inserted instrumentation of load and
> store accesses, and utilizes runtime support to monitor heap allocations
> and profile data. The target consumer of the heap memory profile
> information is initially tooling and ultimately automatic data layout
> optimizations performed by the compiler and/or allocation runtime (with the
> support of new allocation runtime APIs).
>
> Each memory address is mapped to Shadow Memory
> <https://en.wikipedia.org/wiki/Shadow_memory>, similar to the
approach
> used by the Address Sanitizer
> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
> location implements the profile counter (incremented on accesses to the
> corresponding memory). This granularity was chosen to help avoid counter
> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
> To avoid aliasing of shadow memory for different allocations, we must
> choose a minimum alignment carefully. As discussed further below, we can
> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
> storing necessary heap information for each allocation in a 32-byte header
> block.
>
> The compiler instruments each load and store to increment the associated
> shadow memory counter, in order to determine hotness.
>
> The heap profiler runtime is responsible for tracking allocations and
> deallocations, including the stack at each allocation, and information such
> as the allocation size and other statistics. I have implemented a prototype
> built using a stripped down and modified version of ASAN, however this will
> be a separate library utilizing sanitizer_common components.
> Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting memory
> accesses (loads, stores, atomics), with a simple load, increment, store of
> the associated shadow memory location (computed via a mask and shift to do
> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
> The handling is very similar to and based off of the ASAN instrumentation
> pass, with slightly different instrumentation code.
>
> Various techniques can be used to reduce the overhead, by aggressively
> coalescing counter updates (e.g. given the 32-byte alignment, accesses
> known to be in the same 32-byte block, or across possible aliases since we
> don’t care about the dereferenced values).
>
> Additionally, the Clang driver needs to set up to link with the runtime
> library, much as it does with the sanitizers.
>
> A -fheapprof option is added to enable the instrumentation pass and
> runtime library linking. Similar to -fprofile-generate, -fheapprof will
> accept an argument specifying the directory in which to write the profile.
> Runtime
>
> The heap profiler runtime is responsible for tracking and reporting
> information about heap allocations and accesses, aggregated by allocation
> calling context. For example, the hotness, lifetime, and cpu affinity.
>
> A new heapprof library will be created within compiler-rt. It will
> leverage support within sanitizer_common, which already contains facilities
> like stack context tracking, needed by the heap profiler.
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
> StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing a
> stack trace in a StackTrace object, and storing it in the StackDepot which
> “efficiently stores huge amounts of stack traces”. This is in the
> sanitizer_common subdirectory and the support is shared by ASAN and
> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
> where each StackTrace is assigned a unique id. ASAN stores this id in the
> alloc_context_id field in each ChunkHeader (in the redzone preceding each
> allocation). Additionally, there is support for symbolizing and printing
> StackTrace objects.
> ChunkHeader
>
> The heap profiler needs to track several pieces of information for each
> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
> a minimum of 32-byte alignment by holding this information in a 32-byte
> header block preceding each allocation.
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
> information about the current allocation state, user requested size,
> allocation and free thread ids, the allocation context id (representing the
> call stack at allocation, assigned by the StackDepot as described above),
> and misc other bookkeeping. For heap profiling, this will be converted to a
> 32-byte header block.
>
> Note that we could instead use the metadata section, similar to other
> sanitizers, which is stored in a separate location. However, as described
> above, storing the header block with each allocation enables 32-byte
> alignment without aliasing shadow counters for the same 64 bytes of memory.
>
> In the prototype heap profiler implementation, the header contains the
> following fields:
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for
> quarantined memory in the heap profiler). The header contains a placeholder
> for the data type hash, which is not yet implemented as it needs
> instrumentation and IR support.
> Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) and
> header are recorded in a Heap Info Block (HIB) object. The access count is
> computed from the shadow memory locations for the allocation, as well as
> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
> 8-byte shadow locations for the whole allocation). Other information such
> as the deallocation timestamp (for lifetime computation) and deallocation
> cpu id (to determine migrations) are recorded along with the information in
> the chunk header recorded on allocation.
>
> The prototyped HIB object tracks the following:
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min and
> max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last recorded
> in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged was
> the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
> and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
> HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, holds
> HIB objects from deallocated objects. It is indexed by the stack allocation
> context id from the chunk header, and currently utilizes a simple mod with
> a prime number close to a power of two as the hash (because of the way the
> stack context ids are assigned, a mod of a power of two performs very
> poorly). Thus far, only 4-way associativity has been evaluated.
>
> HIB entries are added or merged into the HIB Table on each deallocation.
> If an entry with a matching stack alloc context id is found in the Table,
> the newly deallocated information is merged into the existing entry. Each
> HIB Table entry currently tracks the min, max and total value of the
> various fields for use in computing and reporting the min, max and average
> when the Table is ultimately dumped.
>
> If no entry with a matching stack alloc context id is found, a new entry
> is created. If this causes an eviction, the evicted entry is dumped
> immediately (by default to stderr, otherwise to a specified report file).
> Later post processing can merge dumped entries with the same stack alloc
> context id.
> Initialization
>
> For ASAN, an __asan_init function initializes the memory allocation
> tracking support, and the ASAN instrumentation pass in LLVM creates a
> global constructor to invoke it. The heap profiler prototype adds a new
> __heapprof_init function, which performs heap profile specific
> initialization, and the heap profile instrumentation pass calls this new
> init function instead by a generated global constructor. It currently
> additionally invokes __asan_init since we are leveraging a modified ASAN
> runtime. Eventually, this should be changed to initialize refactored common
> support.
>
> Note that __asan init is also placed in the .preinit_array when it is
> available, so it is invoked even earlier than global constructors.
> Currently, it is not possible to do this for __heapprof_init, as it calls
> timespec_get in order to get a baseline timestamp (as described in the
> ChunkHeader comments the timestamps (ms) are actually offsets from the
> baseline timestamp, in order to fit into 32 bits), and system calls cannot
> be made that early (dl_init is not complete). Since the constructor
> priority is 1, it should be executed early enough that there are very few
> allocations before it runs, and likely the best solution is to simply
> ignore any allocations before initialization.
> Dumping
>
> For the prototype, the profile is dumped as text with a compact raw format
> to limit its size. Ultimately it should be dumped in a more compact binary
> format (i.e. into a different section of the raw instrumentation based
> profile, with llvm-profdata performing post-processing) which is TBD.
> HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is deallocated.
> At the end of the run (or whenever dumping is requested, discussed later),
> HIB entries need to be created for allocations that are still live.
> Conveniently, the sanitizer allocator already contains a mechanism to walk
> through all chunks of memory it is tracking (ForEachChunk). The heap
> profiler simply looks for all chunks with a chunk state of allocated, and
> creates a HIB the same as would be done on deallocation, adding each to the
> table.
>
> A HIB Table mechanism for printing each entry is then invoked.
>
> By default, the dumping occurs:
>
>    -
>
>    on evictions
>    -
>
>    full table at exit (when the static Allocator object is destructed)
>
>
> For running in a load testing scenario, we will want to add a mechanism to
> provoke finalization (merging currently live allocations) and dumping of
> the HIB Table before exit. This would be similar to the __llvm_profile_dump
> facility used for normal PGO counter dumping.
> Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A
> wrapper to dump all StackTrace objects in the StackDepot will be added.
> This new interface is invoked just after the HIB Table is dumped (on exit
> or via dumping interface).
> Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we may
> need the memory map (from /proc/self/smaps). Specifically, this is needed
> to symbolize binaries using ASLR (Address Space Layout Randomization).
> There is already support for reading this file and dumping it to the
> specified report output file (DumpProcessMap()). This is invoked when the
> profile output file is initialized (HIB Table construction), so that the
> memory map is available at the top of the raw profile.
> Current Status and Next Steps
>
> As mentioned earlier, I have a working prototype based on a simplified
> stripped down version of ASAN. My current plan is to do the following:
>
>    1.
>
>    Refactor out some of the shadow setup code common between ASAN and
>    HWASAN into sanitizer_common.
>    2.
>
>    Rework my prototype into a separate heapprof library in compiler-rt,
>    using sanitizer_common support where possible, and send patches for
review.
>    3.
>
>    Send patches for the heap profiler instrumentation pass and related
>    clang options.
>    4.
>
>    Design/implement binary profile format
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200624/75538068/attachment-0001.html>

Wenlei He via llvm-dev

2020-Jul-05 06:28 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

This sounds very useful. We’ve improved and used
memoro<https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling
and analysis, and we are also looking for ways to leverage memory profile for
PGO/FDO. I think having a common profiling infrastructure for analysis tooling
as well as profile guided optimizations is good design, and having it in LLVM is
also helpful. Very interested in the tooling and optimization that comes after
the profiler.

Two questions:

  *   How does the profiling overhead look? Is that similar to ASAN overhead
from what you’ve seen, which would be higher than PGO instrumentation? Asking
because I’m wondering if any PGO training setup can be used directly for the new
heap profiling.
  *   I’m not familiar with how sanitizer handles stack trace, but for getting
most accurate calling context (use FP rather than dwarf), I guess frame pointer
omission and tail call opt etc. need to be turned off? Is that going to be
implied by -fheapprof?

Thanks,
Wenlei

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa
Johnson via llvm-dev <llvm-dev at lists.llvm.org>
Reply-To: Teresa Johnson <tejohnson at google.com>
Date: Wednesday, June 24, 2020 at 4:58 PM
To: llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <kcc at
google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka
<vitalybuka at google.com>
Cc: David Li <davidxl at google.com>
Subject: [llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi all,

I've included an RFC for a heap profiler design I've been working on in
conjunction with David Li. Please send any questions or feedback. For sanitizer
folks, one area of feedback is on refactoring some of the *ASAN shadow setup
code (see the Shadow Memory section).

Thanks,
Teresa


RFC: Sanitizer-based Heap Profiler

Summary

This document provides an overview of an LLVM Sanitizer-based heap profiler
design.

Motivation

The objective of heap memory profiling is to collect critical runtime
information associated with heap memory references and information on heap
memory allocations. The profile information will be used first for tooling, and
subsequently to guide the compiler optimizer and allocation runtime to layout
heap objects with improved spatial locality. As a  result, DTLB and cache
utilization will be improved, and program IPC (performance) will be increased
due to reduced TLB and cache misses. More details on the heap profile guided
optimizations will be shared in the future.

Overview

The profiler is based on compiler inserted instrumentation of load and store
accesses, and utilizes runtime support to monitor heap allocations and profile
data. The target consumer of the heap memory profile information is initially
tooling and ultimately automatic data layout optimizations performed by the
compiler and/or allocation runtime (with the support of new allocation runtime
APIs).


Each memory address is mapped to Shadow
Memory<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
similar to the approach used by the Address
Sanitizer<https://github.com/google/sanitizers/wiki/AddressSanitizer>
(ASAN). Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow location
implements the profile counter (incremented on accesses to the corresponding
memory). This granularity was chosen to help avoid counter overflow, but it may
be possible to consider mapping 32-bytes to 4 bytes. To avoid aliasing of shadow
memory for different allocations, we must choose a minimum alignment carefully.
As discussed further below, we can attain a 32-byte minimum alignment, instead
of a 64-byte alignment, by storing necessary heap information for each
allocation in a 32-byte header block.


The compiler instruments each load and store to increment the associated shadow
memory counter, in order to determine hotness.


The heap profiler runtime is responsible for tracking allocations and
deallocations, including the stack at each allocation, and information such as
the allocation size and other statistics. I have implemented a prototype built
using a stripped down and modified version of ASAN, however this will be a
separate library utilizing sanitizer_common components.

Compiler

A simple HeapProfiler instrumentation pass instruments interesting memory
accesses (loads, stores, atomics), with a simple load, increment, store of the
associated shadow memory location (computed via a mask and shift to do the
mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). The
handling is very similar to and based off of the ASAN instrumentation pass, with
slightly different instrumentation code.


Various techniques can be used to reduce the overhead, by aggressively
coalescing counter updates (e.g. given the 32-byte alignment, accesses known to
be in the same 32-byte block, or across possible aliases since we don’t care
about the dereferenced values).


Additionally, the Clang driver needs to set up to link with the runtime library,
much as it does with the sanitizers.


A -fheapprof option is added to enable the instrumentation pass and runtime
library linking. Similar to -fprofile-generate, -fheapprof will accept an
argument specifying the directory in which to write the profile.

Runtime

The heap profiler runtime is responsible for tracking and reporting information
about heap allocations and accesses, aggregated by allocation calling context.
For example, the hotness, lifetime, and cpu affinity.


A new heapprof library will be created within compiler-rt. It will leverage
support within sanitizer_common, which already contains facilities like stack
context tracking, needed by the heap profiler.

Shadow Memory

There are some basic facilities in sanitizer_common for mmap’ing the shadow
memory, but most of the existing setup lives in the ASAN and HWASAN libraries.
In the case of ASAN, there is support for both statically assigned shadow
offsets (the default on most platforms), and for dynamically assigned shadow
memory (implemented for Windows and currently also used for Android and iOS).
According to kcc, recent experiments show that the performance with a dynamic
shadow is close to that with a static mapping. In fact, that is the only
approach currently used by HWASAN. Given the simplicity, the heap profiler will
be implemented with a dynamic shadow as well.


There are a number of functions in ASAN and HWASAN related to setup of the
shadow that are duplicated but very nearly identical, at least for linux (which
seems to be the only OS flavor currently supported for HWASAN). E.g.
ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN there
is another nearly identical copy in PremapShadow, used by Android, whereas in HW
ASAN the premap handling is already commoned with the non-premap handling).
Rather than make yet another copy of these mechanisms, I propose refactoring
them into sanitizer_common versions. Like HWASAN, the initial version of the
heap profiler will be supported for linux only, but other OSes can be added as
needed similar to ASAN.

StackTrace and StackDepot

The sanitizer already contains support for obtaining and representing a stack
trace in a StackTrace object, and storing it in the StackDepot which
“efficiently stores huge amounts of stack traces”. This is in the
sanitizer_common subdirectory and the support is shared by ASAN and
ThreadSanitizer. The StackDepot is essentially an unbounded hash table, where
each StackTrace is assigned a unique id. ASAN stores this id in the
alloc_context_id field in each ChunkHeader (in the redzone preceding each
allocation). Additionally, there is support for symbolizing and printing
StackTrace objects.

ChunkHeader

The heap profiler needs to track several pieces of information for each
allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve a
minimum of 32-byte alignment by holding this information in a 32-byte header
block preceding each allocation.


In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
information about the current allocation state, user requested size, allocation
and free thread ids, the allocation context id (representing the call stack at
allocation, assigned by the StackDepot as described above), and misc other
bookkeeping. For heap profiling, this will be converted to a 32-byte header
block.


Note that we could instead use the metadata section, similar to other
sanitizers, which is stored in a separate location. However, as described above,
storing the header block with each allocation enables 32-byte alignment without
aliasing shadow counters for the same 64 bytes of memory.


In the prototype heap profiler implementation, the header contains the following
fields:


// Should be 32 bytes

struct ChunkHeader {

  // 1-st 4 bytes

  // Carry over from ASAN (available, allocated, quarantined). Will be

  // reduced to 1 bit (available or allocated).

  u32 chunk_state       : 8;

  // Carry over from ASAN. Used to determine the start of user allocation.

  u32 from_memalign     : 1;

  // 23 bits available


  // 2-nd 4 bytes

  // Carry over from ASAN (comment copied verbatim).

  // This field is used for small sizes. For large sizes it is equal to

  // SizeClassMap::kMaxSize and the actual size is stored in the

  // SecondaryAllocator's metadata.

  u32 user_requested_size : 29;


  // 3-rd 4 bytes

  u32 cpu_id; // Allocation cpu id


  // 4-th 4 bytes

  // Allocation timestamp in ms from a baseline timestamp computed at

  // the start of profiling (to keep this within 32 bits).

  u32 timestamp_ms;


  // 5-th and 6-th 4 bytes

  // Carry over from ASAN. Used to identify allocation stack trace.

  u64 alloc_context_id;


  // 7-th and 8-th 4 bytes

  // UNIMPLEMENTED in prototype - needs instrumentation and IR support.

  u64 data_type_id; // hash of type name

};

As noted, the chunk state can be reduced to a single bit (no need for
quarantined memory in the heap profiler). The header contains a placeholder for
the data type hash, which is not yet implemented as it needs instrumentation and
IR support.

Heap Info Block (HIB)

On a deallocation, information from the corresponding shadow block(s) and header
are recorded in a Heap Info Block (HIB) object. The access count is computed
from the shadow memory locations for the allocation, as well as the percentage
of accessed 64-byte blocks (i.e. the percentage of non-zero 8-byte shadow
locations for the whole allocation). Other information such as the deallocation
timestamp (for lifetime computation) and deallocation cpu id (to determine
migrations) are recorded along with the information in the chunk header recorded
on allocation.


The prototyped HIB object tracks the following:


struct HeapInfoBlock {

  // Total allocations at this stack context

  u32 alloc_count;

  // Access count computed from all allocated 64-byte blocks (track total

  // across all allocations, and the min and max).

  u64 total_access_count, min_access_count, max_access_count;

  // Allocated size (track total across all allocations, and the min and max).

  u64 total_size;

  u32 min_size, max_size;

  // Lifetime (track total across all allocations, and the min and max).

  u64 total_lifetime;

  u32 min_lifetime, max_lifetime;

  // Percent utilization of allocated 64-byte blocks (track total

  // across all allocations, and the min and max). The utilization is

  // defined as the percentage of 8-byte shadow counters corresponding to

  // the full allocation that are non-zero.

  u64 total_percent_utilized;

  u32 min_percent_utilized, max_percent_utilized;

  // Allocation and deallocation timestamps from the most recent merge into

  // the table with this stack context.

  u32 alloc_timestamp, dealloc_timestamp;

  // Allocation and deallocation cpu ids from the most recent merge into

  // the table with this stack context.

  u32 alloc_cpu_id, dealloc_cpu_id;

  // Count of allocations at this stack context that had a different

  // allocation and deallocation cpu id.

  u32 num_migrated_cpu;

  // Number of times the lifetime of the entry being merged had its lifetime

  // overlap with the previous entry merged with this stack context (by

  // comparing the new alloc/dealloc timestamp with the one last recorded in

  // the entry in the table.

  u32 num_lifetime_overlaps;

  // Number of times the alloc/dealloc cpu of the entry being merged was the

  // same as that of the previous entry merged with this stack context

  u32 num_same_alloc_cpu;

  u32 num_same_dealloc_cpu;

  // Hash of type name (UNIMPLEMENTED). This needs instrumentation support and

  // possibly IR changes.

  u64 data_type_id;

}

HIB Table

The Heap Info Block Table, which is a multi-way associative cache, holds HIB
objects from deallocated objects. It is indexed by the stack allocation context
id from the chunk header, and currently utilizes a simple mod with a prime
number close to a power of two as the hash (because of the way the stack context
ids are assigned, a mod of a power of two performs very poorly). Thus far, only
4-way associativity has been evaluated.


HIB entries are added or merged into the HIB Table on each deallocation. If an
entry with a matching stack alloc context id is found in the Table, the newly
deallocated information is merged into the existing entry. Each HIB Table entry
currently tracks the min, max and total value of the various fields for use in
computing and reporting the min, max and average when the Table is ultimately
dumped.


If no entry with a matching stack alloc context id is found, a new entry is
created. If this causes an eviction, the evicted entry is dumped immediately (by
default to stderr, otherwise to a specified report file). Later post processing
can merge dumped entries with the same stack alloc context id.

Initialization


For ASAN, an __asan_init function initializes the memory allocation tracking
support, and the ASAN instrumentation pass in LLVM creates a global constructor
to invoke it. The heap profiler prototype adds a new __heapprof_init function,
which performs heap profile specific initialization, and the heap profile
instrumentation pass calls this new init function instead by a generated global
constructor. It currently additionally invokes __asan_init since we are
leveraging a modified ASAN runtime. Eventually, this should be changed to
initialize refactored common support.


Note that __asan init is also placed in the .preinit_array when it is available,
so it is invoked even earlier than global constructors. Currently, it is not
possible to do this for __heapprof_init, as it calls timespec_get in order to
get a baseline timestamp (as described in the ChunkHeader comments the
timestamps (ms) are actually offsets from the baseline timestamp, in order to
fit into 32 bits), and system calls cannot be made that early (dl_init is not
complete). Since the constructor priority is 1, it should be executed early
enough that there are very few allocations before it runs, and likely the best
solution is to simply ignore any allocations before initialization.

Dumping

For the prototype, the profile is dumped as text with a compact raw format to
limit its size. Ultimately it should be dumped in a more compact binary format
(i.e. into a different section of the raw instrumentation based profile, with
llvm-profdata performing post-processing) which is TBD.

HIB Dumping

As noted earlier, HIB Table entries are created as memory is deallocated. At the
end of the run (or whenever dumping is requested, discussed later), HIB entries
need to be created for allocations that are still live. Conveniently, the
sanitizer allocator already contains a mechanism to walk through all chunks of
memory it is tracking (ForEachChunk). The heap profiler simply looks for all
chunks with a chunk state of allocated, and creates a HIB the same as would be
done on deallocation, adding each to the table.


A HIB Table mechanism for printing each entry is then invoked.


By default, the dumping occurs:

  *   on evictions
  *   full table at exit (when the static Allocator object is destructed)


For running in a load testing scenario, we will want to add a mechanism to
provoke finalization (merging currently live allocations) and dumping of the HIB
Table before exit. This would be similar to the __llvm_profile_dump facility
used for normal PGO counter dumping.

Stack Trace Dumping

There is existing support for dumping symbolized StackTrace objects. A wrapper
to dump all StackTrace objects in the StackDepot will be added. This new
interface is invoked just after the HIB Table is dumped (on exit or via dumping
interface).

Memory Map Dumping

In cases where we may want to symbolize as a post processing step, we may need
the memory map (from /proc/self/smaps). Specifically, this is needed to
symbolize binaries using ASLR (Address Space Layout Randomization). There is
already support for reading this file and dumping it to the specified report
output file (DumpProcessMap()). This is invoked when the profile output file is
initialized (HIB Table construction), so that the memory map is available at the
top of the raw profile.

Current Status and Next Steps


As mentioned earlier, I have a working prototype based on a simplified stripped
down version of ASAN. My current plan is to do the following:

  1.  Refactor out some of the shadow setup code common between ASAN and HWASAN
into sanitizer_common.
  2.  Rework my prototype into a separate heapprof library in compiler-rt, using
sanitizer_common support where possible, and send patches for review.
  3.  Send patches for the heap profiler instrumentation pass and related clang
options.
  4.  Design/implement binary profile format

--
Teresa Johnson |
 Software Engineer |
 tejohnson at google.com<mailto:tejohnson at google.com> |

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/560ecd9e/attachment-0001.html>

Xinliang David Li via llvm-dev

2020-Jul-05 20:43 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
> This sounds very useful. We’ve improved and used memoro
> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling
and
> analysis, and we are also looking for ways to leverage memory profile for
> PGO/FDO. I think having a common profiling infrastructure for analysis
> tooling as well as profile guided optimizations is good design, and having
> it in LLVM is also helpful. Very interested in the tooling and optimization
> that comes after the profiler.
>
>
>
> Two questions:
>
>    - How does the profiling overhead look? Is that similar to ASAN
>    overhead from what you’ve seen, which would be higher than PGO
>    instrumentation? Asking because I’m wondering if any PGO training setup
can
>    be used directly for the new heap profiling.
>
>It is built on top of ASAN runtime, but the overhead can be made much lower
by using counter update consolidation -- all fields sharing the same shadow
counter can be merged, and aggressive loop sinking/hoisting can be done.

The goal is to integrate this with the PGO instrumentation. The PGO
instrumentation overhead can be further reduced with sampling technique
(Rong Xu has a patch to be submitted).

>    -
>    - I’m not familiar with how sanitizer handles stack trace, but for
>    getting most accurate calling context (use FP rather than dwarf), I
guess
>    frame pointer omission and tail call opt etc. need to be turned off? Is
>    that going to be implied by -fheapprof?
>
>Kostya can provide detailed answers to these questions.

David
>
>    -
>
>
>
> Thanks,
>
> Wenlei
>
>
>
> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Teresa
> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
> *Reply-To: *Teresa Johnson <tejohnson at google.com>
> *Date: *Wednesday, June 24, 2020 at 4:58 PM
> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>,
Vitaly Buka <
> vitalybuka at google.com>
> *Cc: *David Li <davidxl at google.com>
> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>
>
>
> Hi all,
>
>
>
> I've included an RFC for a heap profiler design I've been working
on in
> conjunction with David Li. Please send any questions or feedback. For
> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
> shadow setup code (see the Shadow Memory section).
>
>
>
> Thanks,
>
> Teresa
>
>
>
> RFC: Sanitizer-based Heap Profiler
> Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap
> profiler design.
> Motivation
>
> The objective of heap memory profiling is to collect critical runtime
> information associated with heap memory references and information on heap
> memory allocations. The profile information will be used first for tooling,
> and subsequently to guide the compiler optimizer and allocation runtime to
> layout heap objects with improved spatial locality. As a  result, DTLB and
> cache utilization will be improved, and program IPC (performance) will be
> increased due to reduced TLB and cache misses. More details on the heap
> profile guided optimizations will be shared in the future.
> Overview
>
> The profiler is based on compiler inserted instrumentation of load and
> store accesses, and utilizes runtime support to monitor heap allocations
> and profile data. The target consumer of the heap memory profile
> information is initially tooling and ultimately automatic data layout
> optimizations performed by the compiler and/or allocation runtime (with the
> support of new allocation runtime APIs).
>
>
>
> Each memory address is mapped to Shadow Memory
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
> similar to the approach used by the Address Sanitizer
> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
> location implements the profile counter (incremented on accesses to the
> corresponding memory). This granularity was chosen to help avoid counter
> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
> To avoid aliasing of shadow memory for different allocations, we must
> choose a minimum alignment carefully. As discussed further below, we can
> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
> storing necessary heap information for each allocation in a 32-byte header
> block.
>
>
>
> The compiler instruments each load and store to increment the associated
> shadow memory counter, in order to determine hotness.
>
>
>
> The heap profiler runtime is responsible for tracking allocations and
> deallocations, including the stack at each allocation, and information such
> as the allocation size and other statistics. I have implemented a prototype
> built using a stripped down and modified version of ASAN, however this will
> be a separate library utilizing sanitizer_common components.
> Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting memory
> accesses (loads, stores, atomics), with a simple load, increment, store of
> the associated shadow memory location (computed via a mask and shift to do
> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
> The handling is very similar to and based off of the ASAN instrumentation
> pass, with slightly different instrumentation code.
>
>
>
> Various techniques can be used to reduce the overhead, by aggressively
> coalescing counter updates (e.g. given the 32-byte alignment, accesses
> known to be in the same 32-byte block, or across possible aliases since we
> don’t care about the dereferenced values).
>
>
>
> Additionally, the Clang driver needs to set up to link with the runtime
> library, much as it does with the sanitizers.
>
>
>
> A -fheapprof option is added to enable the instrumentation pass and
> runtime library linking. Similar to -fprofile-generate, -fheapprof will
> accept an argument specifying the directory in which to write the profile.
> Runtime
>
> The heap profiler runtime is responsible for tracking and reporting
> information about heap allocations and accesses, aggregated by allocation
> calling context. For example, the hotness, lifetime, and cpu affinity.
>
>
>
> A new heapprof library will be created within compiler-rt. It will
> leverage support within sanitizer_common, which already contains facilities
> like stack context tracking, needed by the heap profiler.
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
>
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
> StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing a
> stack trace in a StackTrace object, and storing it in the StackDepot which
> “efficiently stores huge amounts of stack traces”. This is in the
> sanitizer_common subdirectory and the support is shared by ASAN and
> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
> where each StackTrace is assigned a unique id. ASAN stores this id in the
> alloc_context_id field in each ChunkHeader (in the redzone preceding each
> allocation). Additionally, there is support for symbolizing and printing
> StackTrace objects.
> ChunkHeader
>
> The heap profiler needs to track several pieces of information for each
> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
> a minimum of 32-byte alignment by holding this information in a 32-byte
> header block preceding each allocation.
>
>
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
> information about the current allocation state, user requested size,
> allocation and free thread ids, the allocation context id (representing the
> call stack at allocation, assigned by the StackDepot as described above),
> and misc other bookkeeping. For heap profiling, this will be converted to a
> 32-byte header block.
>
>
>
> Note that we could instead use the metadata section, similar to other
> sanitizers, which is stored in a separate location. However, as described
> above, storing the header block with each allocation enables 32-byte
> alignment without aliasing shadow counters for the same 64 bytes of memory.
>
>
>
> In the prototype heap profiler implementation, the header contains the
> following fields:
>
>
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for
> quarantined memory in the heap profiler). The header contains a placeholder
> for the data type hash, which is not yet implemented as it needs
> instrumentation and IR support.
> Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) and
> header are recorded in a Heap Info Block (HIB) object. The access count is
> computed from the shadow memory locations for the allocation, as well as
> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
> 8-byte shadow locations for the whole allocation). Other information such
> as the deallocation timestamp (for lifetime computation) and deallocation
> cpu id (to determine migrations) are recorded along with the information in
> the chunk header recorded on allocation.
>
>
>
> The prototyped HIB object tracks the following:
>
>
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min and
> max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last recorded
> in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged was
> the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
> and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
> HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, holds
> HIB objects from deallocated objects. It is indexed by the stack allocation
> context id from the chunk header, and currently utilizes a simple mod with
> a prime number close to a power of two as the hash (because of the way the
> stack context ids are assigned, a mod of a power of two performs very
> poorly). Thus far, only 4-way associativity has been evaluated.
>
>
>
> HIB entries are added or merged into the HIB Table on each deallocation.
> If an entry with a matching stack alloc context id is found in the Table,
> the newly deallocated information is merged into the existing entry. Each
> HIB Table entry currently tracks the min, max and total value of the
> various fields for use in computing and reporting the min, max and average
> when the Table is ultimately dumped.
>
>
>
> If no entry with a matching stack alloc context id is found, a new entry
> is created. If this causes an eviction, the evicted entry is dumped
> immediately (by default to stderr, otherwise to a specified report file).
> Later post processing can merge dumped entries with the same stack alloc
> context id.
> Initialization
>
>
>
> For ASAN, an __asan_init function initializes the memory allocation
> tracking support, and the ASAN instrumentation pass in LLVM creates a
> global constructor to invoke it. The heap profiler prototype adds a new
> __heapprof_init function, which performs heap profile specific
> initialization, and the heap profile instrumentation pass calls this new
> init function instead by a generated global constructor. It currently
> additionally invokes __asan_init since we are leveraging a modified ASAN
> runtime. Eventually, this should be changed to initialize refactored common
> support.
>
>
>
> Note that __asan init is also placed in the .preinit_array when it is
> available, so it is invoked even earlier than global constructors.
> Currently, it is not possible to do this for __heapprof_init, as it calls
> timespec_get in order to get a baseline timestamp (as described in the
> ChunkHeader comments the timestamps (ms) are actually offsets from the
> baseline timestamp, in order to fit into 32 bits), and system calls cannot
> be made that early (dl_init is not complete). Since the constructor
> priority is 1, it should be executed early enough that there are very few
> allocations before it runs, and likely the best solution is to simply
> ignore any allocations before initialization.
> Dumping
>
> For the prototype, the profile is dumped as text with a compact raw format
> to limit its size. Ultimately it should be dumped in a more compact binary
> format (i.e. into a different section of the raw instrumentation based
> profile, with llvm-profdata performing post-processing) which is TBD.
> HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is deallocated.
> At the end of the run (or whenever dumping is requested, discussed later),
> HIB entries need to be created for allocations that are still live.
> Conveniently, the sanitizer allocator already contains a mechanism to walk
> through all chunks of memory it is tracking (ForEachChunk). The heap
> profiler simply looks for all chunks with a chunk state of allocated, and
> creates a HIB the same as would be done on deallocation, adding each to the
> table.
>
>
>
> A HIB Table mechanism for printing each entry is then invoked.
>
>
>
> By default, the dumping occurs:
>
>    - on evictions
>    - full table at exit (when the static Allocator object is destructed)
>
>
>
> For running in a load testing scenario, we will want to add a mechanism to
> provoke finalization (merging currently live allocations) and dumping of
> the HIB Table before exit. This would be similar to the __llvm_profile_dump
> facility used for normal PGO counter dumping.
> Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A
> wrapper to dump all StackTrace objects in the StackDepot will be added.
> This new interface is invoked just after the HIB Table is dumped (on exit
> or via dumping interface).
> Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we may
> need the memory map (from /proc/self/smaps). Specifically, this is needed
> to symbolize binaries using ASLR (Address Space Layout Randomization).
> There is already support for reading this file and dumping it to the
> specified report output file (DumpProcessMap()). This is invoked when the
> profile output file is initialized (HIB Table construction), so that the
> memory map is available at the top of the raw profile.
> Current Status and Next Steps
>
>
>
> As mentioned earlier, I have a working prototype based on a simplified
> stripped down version of ASAN. My current plan is to do the following:
>
>    1. Refactor out some of the shadow setup code common between ASAN and
>    HWASAN into sanitizer_common.
>    2. Rework my prototype into a separate heapprof library in
>    compiler-rt, using sanitizer_common support where possible, and send
>    patches for review.
>    3. Send patches for the heap profiler instrumentation pass and related
>    clang options.
>    4. Design/implement binary profile format
>
>
>
> --
>
> Teresa Johnson |
>
>  Software Engineer |
>
>  tejohnson at google.com |
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/51c0a31c/attachment-0001.html>

Teresa Johnson via llvm-dev

2020-Jul-06 19:08 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Wed, Jun 24, 2020 at 4:57 PM Teresa Johnson <tejohnson at google.com>
wrote:

...
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
>
FYI I just mailed D83247 for this refactoring.



Teresa Johnson |  Software Engineer |  tejohnson at google.com
|>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/6b92a659/attachment.html>

Kostya Serebryany via llvm-dev

2020-Jul-09 01:30 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Wed, Jun 24, 2020 at 4:58 PM Teresa Johnson <tejohnson at google.com>
wrote:
> Hi all,
>
> I've included an RFC for a heap profiler design I've been working
on in
> conjunction with David Li. Please send any questions or feedback. For
> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
> shadow setup code (see the Shadow Memory section).
>
I have no strong opinion on the refactoring of the shadow set up code.
Getting such code in one common place may or may not make the future
maintenance cheaper.

One question about granularity: tcmalloc's allocation granularity is 8, so
by using 64- or 32- byte granules you lose some of the precision.
How important is that?
Another option is to use 8- or 16-byte granularity and 4-byte (saturated?)
counters.
Lots of options to play with -- which is why my recommendation is to start
with a callback-based instrumentation (just use tsan compiler pass for
that),
figure out exactly what you need and then either implement the one best
instrumentation inline, or force the callbacks to be inlined with
(Thin)LTO.

OTOH, with smaller granularity you lose most of the "counter update
consolidation" optimization.

Eagerly looking forward to using this profiler!

--kcc
>
> Thanks,
> Teresa
>
>
> RFC: Sanitizer-based Heap Profiler
> Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap
> profiler design.
> Motivation
>
> The objective of heap memory profiling is to collect critical runtime
> information associated with heap memory references and information on heap
> memory allocations. The profile information will be used first for tooling,
> and subsequently to guide the compiler optimizer and allocation runtime to
> layout heap objects with improved spatial locality. As a  result, DTLB and
> cache utilization will be improved, and program IPC (performance) will be
> increased due to reduced TLB and cache misses. More details on the heap
> profile guided optimizations will be shared in the future.
> Overview
>
> The profiler is based on compiler inserted instrumentation of load and
> store accesses, and utilizes runtime support to monitor heap allocations
> and profile data. The target consumer of the heap memory profile
> information is initially tooling and ultimately automatic data layout
> optimizations performed by the compiler and/or allocation runtime (with the
> support of new allocation runtime APIs).
>
> Each memory address is mapped to Shadow Memory
> <https://en.wikipedia.org/wiki/Shadow_memory>, similar to the
approach
> used by the Address Sanitizer
> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
> location implements the profile counter (incremented on accesses to the
> corresponding memory). This granularity was chosen to help avoid counter
> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
> To avoid aliasing of shadow memory for different allocations, we must
> choose a minimum alignment carefully. As discussed further below, we can
> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
> storing necessary heap information for each allocation in a 32-byte header
> block.
>
> The compiler instruments each load and store to increment the associated
> shadow memory counter, in order to determine hotness.
>
> The heap profiler runtime is responsible for tracking allocations and
> deallocations, including the stack at each allocation, and information such
> as the allocation size and other statistics. I have implemented a prototype
> built using a stripped down and modified version of ASAN, however this will
> be a separate library utilizing sanitizer_common components.
> Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting memory
> accesses (loads, stores, atomics), with a simple load, increment, store of
> the associated shadow memory location (computed via a mask and shift to do
> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
> The handling is very similar to and based off of the ASAN instrumentation
> pass, with slightly different instrumentation code.
>
> Various techniques can be used to reduce the overhead, by aggressively
> coalescing counter updates (e.g. given the 32-byte alignment, accesses
> known to be in the same 32-byte block, or across possible aliases since we
> don’t care about the dereferenced values).
>
> Additionally, the Clang driver needs to set up to link with the runtime
> library, much as it does with the sanitizers.
>
> A -fheapprof option is added to enable the instrumentation pass and
> runtime library linking. Similar to -fprofile-generate, -fheapprof will
> accept an argument specifying the directory in which to write the profile.
> Runtime
>
> The heap profiler runtime is responsible for tracking and reporting
> information about heap allocations and accesses, aggregated by allocation
> calling context. For example, the hotness, lifetime, and cpu affinity.
>
> A new heapprof library will be created within compiler-rt. It will
> leverage support within sanitizer_common, which already contains facilities
> like stack context tracking, needed by the heap profiler.
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
> StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing a
> stack trace in a StackTrace object, and storing it in the StackDepot which
> “efficiently stores huge amounts of stack traces”. This is in the
> sanitizer_common subdirectory and the support is shared by ASAN and
> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
> where each StackTrace is assigned a unique id. ASAN stores this id in the
> alloc_context_id field in each ChunkHeader (in the redzone preceding each
> allocation). Additionally, there is support for symbolizing and printing
> StackTrace objects.
> ChunkHeader
>
> The heap profiler needs to track several pieces of information for each
> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
> a minimum of 32-byte alignment by holding this information in a 32-byte
> header block preceding each allocation.
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
> information about the current allocation state, user requested size,
> allocation and free thread ids, the allocation context id (representing the
> call stack at allocation, assigned by the StackDepot as described above),
> and misc other bookkeeping. For heap profiling, this will be converted to a
> 32-byte header block.
>
> Note that we could instead use the metadata section, similar to other
> sanitizers, which is stored in a separate location. However, as described
> above, storing the header block with each allocation enables 32-byte
> alignment without aliasing shadow counters for the same 64 bytes of memory.
>
> In the prototype heap profiler implementation, the header contains the
> following fields:
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for
> quarantined memory in the heap profiler). The header contains a placeholder
> for the data type hash, which is not yet implemented as it needs
> instrumentation and IR support.
> Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) and
> header are recorded in a Heap Info Block (HIB) object. The access count is
> computed from the shadow memory locations for the allocation, as well as
> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
> 8-byte shadow locations for the whole allocation). Other information such
> as the deallocation timestamp (for lifetime computation) and deallocation
> cpu id (to determine migrations) are recorded along with the information in
> the chunk header recorded on allocation.
>
> The prototyped HIB object tracks the following:
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min and
> max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last recorded
> in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged was
> the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
> and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
> HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, holds
> HIB objects from deallocated objects. It is indexed by the stack allocation
> context id from the chunk header, and currently utilizes a simple mod with
> a prime number close to a power of two as the hash (because of the way the
> stack context ids are assigned, a mod of a power of two performs very
> poorly). Thus far, only 4-way associativity has been evaluated.
>
> HIB entries are added or merged into the HIB Table on each deallocation.
> If an entry with a matching stack alloc context id is found in the Table,
> the newly deallocated information is merged into the existing entry. Each
> HIB Table entry currently tracks the min, max and total value of the
> various fields for use in computing and reporting the min, max and average
> when the Table is ultimately dumped.
>
> If no entry with a matching stack alloc context id is found, a new entry
> is created. If this causes an eviction, the evicted entry is dumped
> immediately (by default to stderr, otherwise to a specified report file).
> Later post processing can merge dumped entries with the same stack alloc
> context id.
> Initialization
>
> For ASAN, an __asan_init function initializes the memory allocation
> tracking support, and the ASAN instrumentation pass in LLVM creates a
> global constructor to invoke it. The heap profiler prototype adds a new
> __heapprof_init function, which performs heap profile specific
> initialization, and the heap profile instrumentation pass calls this new
> init function instead by a generated global constructor. It currently
> additionally invokes __asan_init since we are leveraging a modified ASAN
> runtime. Eventually, this should be changed to initialize refactored common
> support.
>
> Note that __asan init is also placed in the .preinit_array when it is
> available, so it is invoked even earlier than global constructors.
> Currently, it is not possible to do this for __heapprof_init, as it calls
> timespec_get in order to get a baseline timestamp (as described in the
> ChunkHeader comments the timestamps (ms) are actually offsets from the
> baseline timestamp, in order to fit into 32 bits), and system calls cannot
> be made that early (dl_init is not complete). Since the constructor
> priority is 1, it should be executed early enough that there are very few
> allocations before it runs, and likely the best solution is to simply
> ignore any allocations before initialization.
> Dumping
>
> For the prototype, the profile is dumped as text with a compact raw format
> to limit its size. Ultimately it should be dumped in a more compact binary
> format (i.e. into a different section of the raw instrumentation based
> profile, with llvm-profdata performing post-processing) which is TBD.
> HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is deallocated.
> At the end of the run (or whenever dumping is requested, discussed later),
> HIB entries need to be created for allocations that are still live.
> Conveniently, the sanitizer allocator already contains a mechanism to walk
> through all chunks of memory it is tracking (ForEachChunk). The heap
> profiler simply looks for all chunks with a chunk state of allocated, and
> creates a HIB the same as would be done on deallocation, adding each to the
> table.
>
> A HIB Table mechanism for printing each entry is then invoked.
>
> By default, the dumping occurs:
>
>    -
>
>    on evictions
>    -
>
>    full table at exit (when the static Allocator object is destructed)
>
>
> For running in a load testing scenario, we will want to add a mechanism to
> provoke finalization (merging currently live allocations) and dumping of
> the HIB Table before exit. This would be similar to the __llvm_profile_dump
> facility used for normal PGO counter dumping.
> Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A
> wrapper to dump all StackTrace objects in the StackDepot will be added.
> This new interface is invoked just after the HIB Table is dumped (on exit
> or via dumping interface).
> Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we may
> need the memory map (from /proc/self/smaps). Specifically, this is needed
> to symbolize binaries using ASLR (Address Space Layout Randomization).
> There is already support for reading this file and dumping it to the
> specified report output file (DumpProcessMap()). This is invoked when the
> profile output file is initialized (HIB Table construction), so that the
> memory map is available at the top of the raw profile.
> Current Status and Next Steps
>
> As mentioned earlier, I have a working prototype based on a simplified
> stripped down version of ASAN. My current plan is to do the following:
>
>    1.
>
>    Refactor out some of the shadow setup code common between ASAN and
>    HWASAN into sanitizer_common.
>    2.
>
>    Rework my prototype into a separate heapprof library in compiler-rt,
>    using sanitizer_common support where possible, and send patches for
review.
>    3.
>
>    Send patches for the heap profiler instrumentation pass and related
>    clang options.
>    4.
>
>    Design/implement binary profile format
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200708/767a2d8d/attachment.html>

Teresa Johnson via llvm-dev

2020-Jul-09 19:36 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Wed, Jul 8, 2020 at 6:30 PM Kostya Serebryany <kcc at google.com>
wrote:
>
>
> On Wed, Jun 24, 2020 at 4:58 PM Teresa Johnson <tejohnson at
google.com>
> wrote:
>
>> Hi all,
>>
>> I've included an RFC for a heap profiler design I've been
working on in
>> conjunction with David Li. Please send any questions or feedback. For
>> sanitizer folks, one area of feedback is on refactoring some of the
*ASAN
>> shadow setup code (see the Shadow Memory section).
>>
>
> I have no strong opinion on the refactoring of the shadow set up code.
> Getting such code in one common place may or may not make the future
> maintenance cheaper.
>
It was partially motivated by my looking to see what version of shadow
setup to use, as they initially looked somewhat different, only to discover
after carefully tracing through all the values that they were essentially
identical but just structured differently (at least between asan and
hwasan). This is somewhat philosophical, but to me it seems better to
refactor when possible.
>
> One question about granularity: tcmalloc's allocation granularity is 8,
so
> by using 64- or 32- byte granules you lose some of the precision.
>
It depends what you mean by precision. For a single allocation, I end up
accumulating the access counts across all corresponding shadows (i.e.
if>64 bytes). Although I also added in a measure of utilization, which is atthe less precise 64-byte granularity. I'm aligning everything to 32 bytes,
which with the 32-byte header means there is no aliasing of shadows between
different allocations. That being said...

> How important is that?
> Another option is to use 8- or 16-byte granularity and 4-byte (saturated?)
> counters.
>
We've discussed using smaller granularities with saturating math, along
with sampling to reduce the saturation.

> Lots of options to play with -- which is why my recommendation is to start
> with a callback-based instrumentation (just use tsan compiler pass for
> that),
> figure out exactly what you need and then either implement the one best
> instrumentation inline, or force the callbacks to be inlined with
> (Thin)LTO.
>
I'm not convinced yet that using callbacks makes this difficult to play
with, as it can be configured pretty easily from options, and the inline
instrumentation part has been pretty trivial thus far. We'd also have to
build in the counter update consolidation in either case (either with less
counter updates or with fewer/aggregated callbacks).

> OTOH, with smaller granularity you lose most of the "counter update
> consolidation" optimization.
>
Some of it (at least where we would consolidate updates to different fields
using alignment guarantees). But we'd still be able to aggregate across
potentially aliasing calls, including out of loop bodies, where the updates
are to the same field.

>
> Eagerly looking forward to using this profiler!
>
Thanks!
Teresa

>
> --kcc
>
>>
>> Thanks,
>> Teresa
>>
>>
>> RFC: Sanitizer-based Heap Profiler
>> Summary
>>
>> This document provides an overview of an LLVM Sanitizer-based heap
>> profiler design.
>> Motivation
>>
>> The objective of heap memory profiling is to collect critical runtime
>> information associated with heap memory references and information on
heap
>> memory allocations. The profile information will be used first for
tooling,
>> and subsequently to guide the compiler optimizer and allocation runtime
to
>> layout heap objects with improved spatial locality. As a  result, DTLB
and
>> cache utilization will be improved, and program IPC (performance) will
be
>> increased due to reduced TLB and cache misses. More details on the heap
>> profile guided optimizations will be shared in the future.
>> Overview
>>
>> The profiler is based on compiler inserted instrumentation of load and
>> store accesses, and utilizes runtime support to monitor heap
allocations
>> and profile data. The target consumer of the heap memory profile
>> information is initially tooling and ultimately automatic data layout
>> optimizations performed by the compiler and/or allocation runtime (with
the
>> support of new allocation runtime APIs).
>>
>> Each memory address is mapped to Shadow Memory
>> <https://en.wikipedia.org/wiki/Shadow_memory>, similar to the
approach
>> used by the Address Sanitizer
>> <https://github.com/google/sanitizers/wiki/AddressSanitizer>
(ASAN).
>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>> location implements the profile counter (incremented on accesses to the
>> corresponding memory). This granularity was chosen to help avoid
counter
>> overflow, but it may be possible to consider mapping 32-bytes to 4
bytes.
>> To avoid aliasing of shadow memory for different allocations, we must
>> choose a minimum alignment carefully. As discussed further below, we
can
>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>> storing necessary heap information for each allocation in a 32-byte
header
>> block.
>>
>> The compiler instruments each load and store to increment the
associated
>> shadow memory counter, in order to determine hotness.
>>
>> The heap profiler runtime is responsible for tracking allocations and
>> deallocations, including the stack at each allocation, and information
such
>> as the allocation size and other statistics. I have implemented a
prototype
>> built using a stripped down and modified version of ASAN, however this
will
>> be a separate library utilizing sanitizer_common components.
>> Compiler
>>
>> A simple HeapProfiler instrumentation pass instruments interesting
memory
>> accesses (loads, stores, atomics), with a simple load, increment, store
of
>> the associated shadow memory location (computed via a mask and shift to
do
>> the mapping of 64 bytes to 8 byte shadow, and add of the shadow
offset).
>> The handling is very similar to and based off of the ASAN
instrumentation
>> pass, with slightly different instrumentation code.
>>
>> Various techniques can be used to reduce the overhead, by aggressively
>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>> known to be in the same 32-byte block, or across possible aliases since
we
>> don’t care about the dereferenced values).
>>
>> Additionally, the Clang driver needs to set up to link with the runtime
>> library, much as it does with the sanitizers.
>>
>> A -fheapprof option is added to enable the instrumentation pass and
>> runtime library linking. Similar to -fprofile-generate, -fheapprof will
>> accept an argument specifying the directory in which to write the
profile.
>> Runtime
>>
>> The heap profiler runtime is responsible for tracking and reporting
>> information about heap allocations and accesses, aggregated by
allocation
>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>
>> A new heapprof library will be created within compiler-rt. It will
>> leverage support within sanitizer_common, which already contains
facilities
>> like stack context tracking, needed by the heap profiler.
>> Shadow Memory
>>
>> There are some basic facilities in sanitizer_common for mmap’ing the
>> shadow memory, but most of the existing setup lives in the ASAN and
HWASAN
>> libraries. In the case of ASAN, there is support for both statically
>> assigned shadow offsets (the default on most platforms), and for
>> dynamically assigned shadow memory (implemented for Windows and
currently
>> also used for Android and iOS). According to kcc, recent experiments
show
>> that the performance with a dynamic shadow is close to that with a
static
>> mapping. In fact, that is the only approach currently used by HWASAN.
Given
>> the simplicity, the heap profiler will be implemented with a dynamic
shadow
>> as well.
>>
>> There are a number of functions in ASAN and HWASAN related to setup of
>> the shadow that are duplicated but very nearly identical, at least for
>> linux (which seems to be the only OS flavor currently supported for
>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>> FindDynamicShadowStart (in ASAN there is another nearly identical copy
in
>> PremapShadow, used by Android, whereas in HW ASAN the premap handling
is
>> already commoned with the non-premap handling). Rather than make yet
>> another copy of these mechanisms, I propose refactoring them into
>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>> profiler will be supported for linux only, but other OSes can be added
as
>> needed similar to ASAN.
>> StackTrace and StackDepot
>>
>> The sanitizer already contains support for obtaining and representing a
>> stack trace in a StackTrace object, and storing it in the StackDepot
which
>> “efficiently stores huge amounts of stack traces”. This is in the
>> sanitizer_common subdirectory and the support is shared by ASAN and
>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>> where each StackTrace is assigned a unique id. ASAN stores this id in
the
>> alloc_context_id field in each ChunkHeader (in the redzone preceding
each
>> allocation). Additionally, there is support for symbolizing and
printing
>> StackTrace objects.
>> ChunkHeader
>>
>> The heap profiler needs to track several pieces of information for each
>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can
achieve
>> a minimum of 32-byte alignment by holding this information in a 32-byte
>> header block preceding each allocation.
>>
>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>> contains information about the current allocation state, user requested
>> size, allocation and free thread ids, the allocation context id
>> (representing the call stack at allocation, assigned by the StackDepot
as
>> described above), and misc other bookkeeping. For heap profiling, this
will
>> be converted to a 32-byte header block.
>>
>> Note that we could instead use the metadata section, similar to other
>> sanitizers, which is stored in a separate location. However, as
described
>> above, storing the header block with each allocation enables 32-byte
>> alignment without aliasing shadow counters for the same 64 bytes of
memory.
>>
>> In the prototype heap profiler implementation, the header contains the
>> following fields:
>>
>> // Should be 32 bytes
>>
>> struct ChunkHeader {
>>
>>   // 1-st 4 bytes
>>
>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>
>>   // reduced to 1 bit (available or allocated).
>>
>>   u32 chunk_state       : 8;
>>
>>   // Carry over from ASAN. Used to determine the start of user
allocation.
>>
>>   u32 from_memalign     : 1;
>>
>>   // 23 bits available
>>
>>   // 2-nd 4 bytes
>>
>>   // Carry over from ASAN (comment copied verbatim).
>>
>>   // This field is used for small sizes. For large sizes it is equal to
>>
>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>
>>   // SecondaryAllocator's metadata.
>>
>>   u32 user_requested_size : 29;
>>
>>   // 3-rd 4 bytes
>>
>>   u32 cpu_id; // Allocation cpu id
>>
>>   // 4-th 4 bytes
>>
>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>
>>   // the start of profiling (to keep this within 32 bits).
>>
>>   u32 timestamp_ms;
>>
>>   // 5-th and 6-th 4 bytes
>>
>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>
>>   u64 alloc_context_id;
>>
>>   // 7-th and 8-th 4 bytes
>>
>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>
>>   u64 data_type_id; // hash of type name
>>
>> };
>>
>> As noted, the chunk state can be reduced to a single bit (no need for
>> quarantined memory in the heap profiler). The header contains a
placeholder
>> for the data type hash, which is not yet implemented as it needs
>> instrumentation and IR support.
>> Heap Info Block (HIB)
>>
>> On a deallocation, information from the corresponding shadow block(s)
and
>> header are recorded in a Heap Info Block (HIB) object. The access count
is
>> computed from the shadow memory locations for the allocation, as well
as
>> the percentage of accessed 64-byte blocks (i.e. the percentage of
non-zero
>> 8-byte shadow locations for the whole allocation). Other information
such
>> as the deallocation timestamp (for lifetime computation) and
deallocation
>> cpu id (to determine migrations) are recorded along with the
information in
>> the chunk header recorded on allocation.
>>
>> The prototyped HIB object tracks the following:
>>
>> struct HeapInfoBlock {
>>
>>   // Total allocations at this stack context
>>
>>   u32 alloc_count;
>>
>>   // Access count computed from all allocated 64-byte blocks (track
total
>>
>>   // across all allocations, and the min and max).
>>
>>   u64 total_access_count, min_access_count, max_access_count;
>>
>>   // Allocated size (track total across all allocations, and the min
and
>> max).
>>
>>   u64 total_size;
>>
>>   u32 min_size, max_size;
>>
>>   // Lifetime (track total across all allocations, and the min and
max).
>>
>>   u64 total_lifetime;
>>
>>   u32 min_lifetime, max_lifetime;
>>
>>   // Percent utilization of allocated 64-byte blocks (track total
>>
>>   // across all allocations, and the min and max). The utilization is
>>
>>   // defined as the percentage of 8-byte shadow counters corresponding
to
>>
>>   // the full allocation that are non-zero.
>>
>>   u64 total_percent_utilized;
>>
>>   u32 min_percent_utilized, max_percent_utilized;
>>
>>   // Allocation and deallocation timestamps from the most recent merge
>> into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_timestamp, dealloc_timestamp;
>>
>>   // Allocation and deallocation cpu ids from the most recent merge
into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>
>>   // Count of allocations at this stack context that had a different
>>
>>   // allocation and deallocation cpu id.
>>
>>   u32 num_migrated_cpu;
>>
>>   // Number of times the lifetime of the entry being merged had its
>> lifetime
>>
>>   // overlap with the previous entry merged with this stack context (by
>>
>>   // comparing the new alloc/dealloc timestamp with the one last
recorded
>> in
>>
>>   // the entry in the table.
>>
>>   u32 num_lifetime_overlaps;
>>
>>   // Number of times the alloc/dealloc cpu of the entry being merged
was
>> the
>>
>>   // same as that of the previous entry merged with this stack context
>>
>>   u32 num_same_alloc_cpu;
>>
>>   u32 num_same_dealloc_cpu;
>>
>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>> support and
>>
>>   // possibly IR changes.
>>
>>   u64 data_type_id;
>>
>> }
>> HIB Table
>>
>> The Heap Info Block Table, which is a multi-way associative cache,
holds
>> HIB objects from deallocated objects. It is indexed by the stack
allocation
>> context id from the chunk header, and currently utilizes a simple mod
with
>> a prime number close to a power of two as the hash (because of the way
the
>> stack context ids are assigned, a mod of a power of two performs very
>> poorly). Thus far, only 4-way associativity has been evaluated.
>>
>> HIB entries are added or merged into the HIB Table on each
deallocation.
>> If an entry with a matching stack alloc context id is found in the
Table,
>> the newly deallocated information is merged into the existing entry.
Each
>> HIB Table entry currently tracks the min, max and total value of the
>> various fields for use in computing and reporting the min, max and
average
>> when the Table is ultimately dumped.
>>
>> If no entry with a matching stack alloc context id is found, a new
entry
>> is created. If this causes an eviction, the evicted entry is dumped
>> immediately (by default to stderr, otherwise to a specified report
file).
>> Later post processing can merge dumped entries with the same stack
alloc
>> context id.
>> Initialization
>>
>> For ASAN, an __asan_init function initializes the memory allocation
>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>> global constructor to invoke it. The heap profiler prototype adds a new
>> __heapprof_init function, which performs heap profile specific
>> initialization, and the heap profile instrumentation pass calls this
new
>> init function instead by a generated global constructor. It currently
>> additionally invokes __asan_init since we are leveraging a modified
ASAN
>> runtime. Eventually, this should be changed to initialize refactored
common
>> support.
>>
>> Note that __asan init is also placed in the .preinit_array when it is
>> available, so it is invoked even earlier than global constructors.
>> Currently, it is not possible to do this for __heapprof_init, as it
calls
>> timespec_get in order to get a baseline timestamp (as described in the
>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>> baseline timestamp, in order to fit into 32 bits), and system calls
cannot
>> be made that early (dl_init is not complete). Since the constructor
>> priority is 1, it should be executed early enough that there are very
few
>> allocations before it runs, and likely the best solution is to simply
>> ignore any allocations before initialization.
>> Dumping
>>
>> For the prototype, the profile is dumped as text with a compact raw
>> format to limit its size. Ultimately it should be dumped in a more
compact
>> binary format (i.e. into a different section of the raw instrumentation
>> based profile, with llvm-profdata performing post-processing) which is
TBD.
>> HIB Dumping
>>
>> As noted earlier, HIB Table entries are created as memory is
deallocated.
>> At the end of the run (or whenever dumping is requested, discussed
later),
>> HIB entries need to be created for allocations that are still live.
>> Conveniently, the sanitizer allocator already contains a mechanism to
walk
>> through all chunks of memory it is tracking (ForEachChunk). The heap
>> profiler simply looks for all chunks with a chunk state of allocated,
and
>> creates a HIB the same as would be done on deallocation, adding each to
the
>> table.
>>
>> A HIB Table mechanism for printing each entry is then invoked.
>>
>> By default, the dumping occurs:
>>
>>    -
>>
>>    on evictions
>>    -
>>
>>    full table at exit (when the static Allocator object is destructed)
>>
>>
>> For running in a load testing scenario, we will want to add a mechanism
>> to provoke finalization (merging currently live allocations) and
dumping of
>> the HIB Table before exit. This would be similar to the
__llvm_profile_dump
>> facility used for normal PGO counter dumping.
>> Stack Trace Dumping
>>
>> There is existing support for dumping symbolized StackTrace objects. A
>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>> This new interface is invoked just after the HIB Table is dumped (on
exit
>> or via dumping interface).
>> Memory Map Dumping
>>
>> In cases where we may want to symbolize as a post processing step, we
may
>> need the memory map (from /proc/self/smaps). Specifically, this is
needed
>> to symbolize binaries using ASLR (Address Space Layout Randomization).
>> There is already support for reading this file and dumping it to the
>> specified report output file (DumpProcessMap()). This is invoked when
the
>> profile output file is initialized (HIB Table construction), so that
the
>> memory map is available at the top of the raw profile.
>> Current Status and Next Steps
>>
>> As mentioned earlier, I have a working prototype based on a simplified
>> stripped down version of ASAN. My current plan is to do the following:
>>
>>    1.
>>
>>    Refactor out some of the shadow setup code common between ASAN and
>>    HWASAN into sanitizer_common.
>>    2.
>>
>>    Rework my prototype into a separate heapprof library in compiler-rt,
>>    using sanitizer_common support where possible, and send patches for
review.
>>    3.
>>
>>    Send patches for the heap profiler instrumentation pass and related
>>    clang options.
>>    4.
>>
>>    Design/implement binary profile format
>>
>>
>> --
>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>
>
-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200709/1184a56c/attachment-0001.html>

llvm dev - Jul 2020 - RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler