thr3ads.net - llvm dev - [llvm-dev] RFC: llvm support for trace profile driven cache prefetching insertion [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Mircea Trofin via llvm-dev

2018-Nov-02 22:23 UTC

[llvm-dev] RFC: llvm support for trace profile driven cache prefetching insertion

This change is part of a larger system, consisting of a cache prefetches
recommender, create_llvm_prof <https://github.com/google/autofdo>, and
LLVM.

A proof of concept recommender is DynamoRIO's cache miss analyzer
<https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/simulator/cache_miss_analyzer.cpp>.
It processes memory access traces obtained from a running binary and
identifies patterns in cache misses. Based on them, it produces a csv file
with recommendations. The expectation is that, by leveraging such
recommendations, we can reduce the amount of clock cycles spent waiting for
data from memory. A microbenchmark based on the DynamoRIO analyzer is
available as a proof of concept <https://goo.gl/6TM2Xp>.

The recommender makes prefetch recommendations in terms of:

- the binary offset of an instruction with a memory operand;
- a delta;
- and a type (nta, t0, t1, t2)

meaning: a prefetch of that type should be inserted right before the
instruction at that binary offset, and the prefetch should be for an
address delta away from the memory address the instruction will access.

For example:

0x400ab2,64,nta

and assuming the instruction at 0x400ab2 is:

movzbl (%rbx,%rdx,1),%edx

means that the recommender determined it would be beneficial for a
prefetchnta instruction to be inserted right before this instruction, as
such:

prefetchnta 0x40(%rbx,%rdx,1)
movzbl (%rbx, %rdx, 1), %edx

The workflow for prefetch cache instrumentation is as follows (the proof of
concept script details these steps as well):

1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed.
The latter option will enable the X86DiscriminateMemOps pass, which ensures
instructions with memory operands are uniquely identifiable (this causes
~2% size increase in total binary size due to the additional debug
information).

2. collect memory traces, run analysis to obtain recommendations (see
above-referenced DynamoRIO-based analyzer demo as a proof of concept).

3. use create_llvm_prof to convert recommendations to reference insertion
locations in terms of debug info locations.

4. rebuild binary, using the exact same set of arguments used initially, to
which -mllvm -prefetch-hints-file=<file> need to be added, using the afdo
file obtained at step 3.

Note that if sample profiling feedback-driven optimization is also desired,
that happens before step 1 above. In this case, the sample profile afdo
file that was used to produce the binary at step 1 must also be included in
step 4.

The data needed by the compiler in order to identify prefetch insertion
points is very similar to what is needed for sample profiles. For this
reason, and given that the overall approach (memory tracing-based cache
recommendation mechanisms) is under active development, we use the afdo
format as a syntax for capturing this information. We avoid confusing
semantics with sample profile afdo data by feeding the two types of
information to the compiler through separate files and compiler flags.
Should the approach prove successful, we can investigate improvements to
this encoding mechanism.

https://reviews.llvm.org/D54052
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181102/4017e799/attachment.html>

llvm dev - Nov 2018 - RFC: llvm support for trace profile driven cache prefetching insertion

[llvm-dev] RFC: llvm support for trace profile driven cache prefetching insertion