thr3ads.net - llvm dev - [llvm-dev] RFC: Sanitizer-based Heap Profiler [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Andrey Bokhanko via llvm-dev

2021-Jul-08 15:03 UTC

[llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi Teresa,

One more thing, if you don't mind.

On Tue, Jul 6, 2021 at 12:54 AM Teresa Johnson <tejohnson at google.com>
wrote:
> We initially plan to use the profile information to provide guidance to
> the dynamic allocation runtime on data allocation and placement. We'll
send
> more details on that when it is fleshed out too.
>
I played with the current implementation, and became a bit concerned if the
current data profile is sufficient for an efficient data allocation
optimization.

First, there is no information on temporal locality -- only total_lifetime
of an allocation block is recorded, not start / end times -- let alone
timestamps of actual memory accesses. I wonder what criteria would be used
by data profile-based allocation runtime to allocate two blocks from the
same memory chunk?

Second, according to the data from [Savage'20], memory accesses affinity
(space distance between temporarily close memory accesses from two different
allocated blocks) is crucial: figure #12 demonstrates that this is vital
for omnetpp benchmark from SPEC CPU 2017.

Said this, my concerns are based essentially on a single paper that employs
specific algorithms to guide memory allocation and measures their impact on
a specific set of benchmarks. I wonder if you have preliminary data that
validates sufficiency of the implemented data profile for efficient
optimization of heap memory allocations?

References:
[Savage'20] Savage, J., & Jones, T. M. (2020). HALO: Post-Link
Heap-Layout
Optimisation. CGO 2020: Proceedings of the 18th ACM/IEEE International
Symposium on Code Generation and Optimization,
https://doi.org/10.1145/3368826.3377914

Yours,
Andrey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210708/3719412c/attachment.html>

Teresa Johnson via llvm-dev

2021-Jul-08 15:58 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

Hi Andrey,

I was actually just typing up a reply welcoming contributions and to
suggest you give the existing profile support a try - I realized I need to
add documentation for the usage to llvm/clang's docs which I will do soon
but it sounds like you figured it out ok.

Some answers below.

On Thu, Jul 8, 2021 at 8:03 AM Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi Teresa,
>
> One more thing, if you don't mind.
>
> On Tue, Jul 6, 2021 at 12:54 AM Teresa Johnson <tejohnson at
google.com>
> wrote:
>
>> We initially plan to use the profile information to provide guidance to
>> the dynamic allocation runtime on data allocation and placement.
We'll send
>> more details on that when it is fleshed out too.
>>
>
> I played with the current implementation, and became a bit concerned if
> the current data profile is sufficient for an efficient data allocation
> optimization.
>
> First, there is no information on temporal locality -- only total_lifetime
> of an allocation block is recorded, not start / end times -- let alone
> timestamps of actual memory accesses. I wonder what criteria would be used
> by data profile-based allocation runtime to allocate two blocks from the
> same memory chunk?
>
It would be difficult to add all of this information for every allocation
and particularly every access without being prohibitively expensive. Right
now we have the ave/min/max lifetime, and just a single boolean per context
indicating whether there was a lifetime overlap with the prior allocation
for that context. We can probably expand this a bit to have somewhat richer
aggregate information, but like I said, recording and emitting all
start/end times and timestamps will be an overwhelming amount of
information. As I mentioned in my other response, initially the goal is to
provide hints about hotness and lifetime length (short vs long) to the
memory allocator so that it can make smarter decisions about how and where
to allocate data.

>
> Second, according to the data from [Savage'20], memory accesses
affinity
> (= space distance between temporarily close memory accesses from two
> different allocated blocks) is crucial: figure #12 demonstrates that this
> is vital for omnetpp benchmark from SPEC CPU 2017.
>
Right now we don't track this information. Part of the issue is that memory
accesses themselves don't interact with the profile runtime library, but
rather the code is instrumented to update shadow counters inline - this
keeps the overhead reasonable. My understanding from reading the HALO paper
and asking the authors at CGO is that the overheads are currently quite
large (both the PIN-based runtime, and also the offline grouping
algorithm), and it didn't support multithreaded applications yet.

Definitely interested in contributions or ideas on how we could collect
richer information with the approach we're taking (allocations tracked by
the runtime per context and fast shadow memory based updates for accesses).

>
> Said this, my concerns are based essentially on a single paper that
> employs specific algorithms to guide memory allocation and measures their
> impact on a specific set of benchmarks. I wonder if you have preliminary
> data that validates sufficiency of the implemented data profile for
> efficient optimization of heap memory allocations?
>
I don't have anything I can share yet but we will do so in the future. For
an idea of how lifetime based allocation would work, here's a related paper
which used ML to identify context-sensitive lifetimes and used the info in
a custom allocator:

https://research.google/pubs/pub49008/
Maas, Martin & Andersen, David & Isard, Michael & Javanmard,
Mohammad Mahdi
& McKinley, Kathryn & Raffel, Colin. (2020). Learning-based Memory
Allocation for C++ Server Workloads. Proceedings of the 25th ACM
International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS). 541-556. 10.1145/3373376.3378525.

Teresa

> References:
> [Savage'20] Savage, J., & Jones, T. M. (2020). HALO: Post-Link
Heap-Layout
> Optimisation. CGO 2020: Proceedings of the 18th ACM/IEEE International
> Symposium on Code Generation and Optimization,
> https://doi.org/10.1145/3368826.3377914
>
> Yours,
> Andrey
>
>
-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210708/49a3c5f1/attachment-0001.html>

Xinliang David Li via llvm-dev

2021-Jul-08 16:54 UTC

head link

[llvm-dev] RFC: Sanitizer-based Heap Profiler

On Thu, Jul 8, 2021 at 8:03 AM Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi Teresa,
>
> One more thing, if you don't mind.
>
> On Tue, Jul 6, 2021 at 12:54 AM Teresa Johnson <tejohnson at
google.com>
> wrote:
>
>> We initially plan to use the profile information to provide guidance to
>> the dynamic allocation runtime on data allocation and placement.
We'll send
>> more details on that when it is fleshed out too.
>>
>
> I played with the current implementation, and became a bit concerned if
> the current data profile is sufficient for an efficient data allocation
> optimization.
>
>
> First, there is no information on temporal locality -- only total_lifetime
> of an allocation block is recorded, not start / end times -- let alone
> timestamps of actual memory accesses. I wonder what criteria would be used
> by data profile-based allocation runtime to allocate two blocks from the
> same memory chunk?
>
First, I think per-allocation start-end time should be added to approximate
temporal locality.

Detailed temporal locality information is not tracked is by design for a
various of reasons:

1.  This can be done with static analysis. The idea is for the compiler to
instrument a potentially hot access region and profile the start and end
address of the accessed memory regions. This information can be combined
with the regular heap profile data. In profile-use phase, the compiler can
perform access pattern analysis and produce affinity graph

2.  We try to make use of existing allocator runtime (tcmalloc) for
locality optimization. The runtime has been tuned for years to have the
most efficient code for fast-path allocation.  For hot allocation sites,
adding too much overhead (e.g. via wrapper etc) can lead to overhead that
totally eat up the gains from the locality optimization;

3. tcmalloc currently uses size class based partitioning, which makes
co-allocation of small objects of different size classes impossible. Even
for objects with the same type/size, due to the use of free lists, there is
no guarantee that consecutively allocated objects are placed together.

4. a bump-pointer allocator has its own sets of problems -- when not used
carefully, it can lead to huge memory waste due to fragmentation.  In
reality it only helps grouping for initial set of allocations when pointer
bumps continuously -- during stable state, the allocations will also be all
over the place and no contiguity can be guaranteed.

This is why initially we focus more coarse grain locality optimization --
1) co-placement to improve DTLB performance and 2) improving dcache
utilization using only lifetime and hotness information.

Longer term, we need to beef up compiler based analysis -- objects with the
exact life times can be safely co-allocated via compiler based
transformation. Also objects with similar lifetimes can be co-allocated
without introducing too much fragmentation.

Thanks,

David

>
> Second, according to the data from [Savage'20], memory accesses
affinity
> (= space distance between temporarily close memory accesses from two
> different allocated blocks) is crucial: figure #12 demonstrates that this
> is vital for omnetpp benchmark from SPEC CPU 2017.
>
> Said this, my concerns are based essentially on a single paper that
> employs specific algorithms to guide memory allocation and measures their
> impact on a specific set of benchmarks. I wonder if you have preliminary
> data that validates sufficiency of the implemented data profile for
> efficient optimization of heap memory allocations?
>
> References:
> [Savage'20] Savage, J., & Jones, T. M. (2020). HALO: Post-Link
Heap-Layout
> Optimisation. CGO 2020: Proceedings of the 18th ACM/IEEE International
> Symposium on Code Generation and Optimization,
> https://doi.org/10.1145/3368826.3377914
>
> Yours,
> Andrey
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210708/555bade4/attachment.html>

llvm dev - Jul 2021 - RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler

[llvm-dev] RFC: Sanitizer-based Heap Profiler