thr3ads.net - llvm dev - [llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2020-Aug-05 16:48 UTC

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

On Tue, Aug 4, 2020 at 10:51 PM aditya kumar <hiraditya at gmail.com>
wrote:
> Glad to hear that there is an interest in a function splitting pass. There
> are advantages to splitting functions at different stages as you've
already
> noted.
>
Right -- with slightly different objectives. Machine Function Splitting
Pass's main focus is on performance improvement.
> - Having a target independent function splitting scales well to LTO,
> ThinLTO, supporting multiple architectures and offers ease of maintenance.
> - While HCS+merge-function helps significantly reduce the codesize, in
> many cases the outlined functions tend to have identical function bodies
> (e.g., assert-fail etc); they can be deduplicated by linker with careful
> function naming. This reduces code-size regardless of function merging and
> across the entire program. This technique should also help Machine Function
> splitter in some cases but at some cost to the link time.
>
yes -- I think this can also be achieved with the partial inlining pass.

- It will be difficult to reduce argument setup and restore code in the
HCS> except in some cases like tail call, internal function calls, non-returning
> function calls etc. Having frame setup however should help with debugging
> IMO.
>
> > Recent work at Google has shown that aggressive, profile-driven
> inlining for performance has led to significant code bloat and icache
> fragmentation
>
> Because the inliner can be too aggressive at times and can negatively
affect icache-miss etc, HCS can be integrated with inliner to assist in partial
inlining (For example: split the callee before inlining). Rodrigo (@rcorcs) has
some ideas around that and we've been exploring that as part of GSoC
project.
>
>The Inliner can also be hindered without splitting. Partial inlining can
help a little, but it can be limited because many of the outlining
opportunities are only exposed after inlining (in inline instances).
 Making inliner Machine Splitting or HCS aware is the way to go.   Machine
Function splitting has an advantage here as it does not need sophisticated
analysis to figure out what part of code can and can not be split out post
inlining.

> We've performance numbers from Firefox which shows ~5% performance
> improvement with HCS (cc: Ruijie @rjf). Vedant also reported performance
> numbers across iOS and Swift benchmarks in the past. I could find (
> https://github.com/apple/swift/pull/21016) which reported decent
> performance improvement in core ios Frameworks.
>
>Nice. If possible, the same performance tests can be done for Machine
Splitting once the patch is posted :)


> That said we have been working on improving the cost model which I think
> will help alleviate many of the limitations that we typically don't
have in
> a Machine Splitting optimization. I'd like to hear your ideas on how
the
> cost model can be improved.
>
>The partial inliner pass has introduced many cost analysis using profile
data. I think HCS should probably share those code (as utilities) as the
nature of the transformation is similar. Cost model can sometimes be quite
tricky though -- it is hard to compare the cost with the actual benefit
brought by the splitting.   The beauty of machine splitting is that it does
not depend on sophisticated cost/benefit model.


> > In contrast, the machine function splitter extracts cold code into a
separate
> section.
> HCS also adds a section prefix to all the cold functions. It is possible
> that the cold functions are still in the same section as the hot one
> depending on the linker. Ruijie has a patch to move all the cold functions
> to a separate section, we are still evaluating the results (
>
https://github.com/ruijiefang/llvm-hcs/commit/4966997e135050c99b4adc0aa971242af99ba832).
> In case it is not difficult to rerun the experiments, it'll help to see
the
> numbers with the llvm-trunk and this patch from @rjf.
>
> > Furthermore, this may not play well with optimization passes such as
> MachineOutliner.
> Can you share an example where HCS and Machine Outliner don't play well
> together?
>
One thing I can think of is that Machine Outliner is based on code pattern,
while HCS also looks at hotness. The inconsistency can lead to missing
opportunities in machine outliner ?
>
> > Synergistic optimizations are harder to reason about due to the pass
> timing. For example, inlining can be more aggressive if any cold code
introduced
> is trimmed.
> How does this regress workloads if we have profile information and cold
> portions of a callee is outlined? Does the inliner always regress workloads
> we are evaluating?
>
>Currently inliner only looks at static code size (as cost). We are working
on improving it.


> PS: One correction I'd like to make is HCS splits SEME regions (Thanks
to
> Vedant). In case some SEME aren't getting outlined, there are CFG
> transformations to make them friendly to HCS. I'd love to see such an
> example, as that'd motivate some of the future work.
>
>It is common to see multiple entry cold regions post inlining.  After block
layout, we may see long chains of cold blocks with many blocks in the chain
being jump targets (from hot regions).   If there are multiple blocks
exiting the cold region, we get multiple exits case.

thanks,

David




> Aditya Kumar
> Compiler Engineer
> https://bitsimplify.com
>
>
> On Tue, Aug 4, 2020 at 5:31 PM Snehasish Kumar <snehasishk at
google.com>
> wrote:
>
>> Greetings,
>>
>> We present “Machine Function Splitter”, a codegen optimization pass
which
>> splits functions into hot and cold parts. This pass leverages the basic
>> block sections feature recently introduced in LLVM from the Propeller
>> project. The pass targets functions with profile coverage, identifies
cold
>> blocks and moves them to a separate section. The linker groups all cold
>> blocks across functions together, decreasing fragmentation and
improving
>> icache and itlb utilization. Our experiments show >2% performance
>> improvement on clang bootstrap, ~1% improvement on Google workloads and
>> 1.6% mean performance improvement on SPEC IntRate 2017.
>> Motivation
>>
>> Recent work at Google has shown that aggressive, profile-driven
inlining
>> for performance has led to significant code bloat and icache
fragmentation (AsmDB
>> - Ayers et al ‘2019 <https://research.google/pubs/pub48320/>). We
find
>> that most functions 5 KiB or larger have inlined children more than 10
>> layers deep bringing in exponentially more code at each inline level,
not
>> all of which is necessarily hot. Generally, in roughly half of even the
>> hottest functions, more than 50% of the code bytes are never executed,
but
>> likely to be in the cache.
>>
>> Function splitting is a well known compiler transformation primarily
>> targeting improved code locality to improve performance. LLVM has a
>> middle-end, target agnostic hot cold splitting pass
>>
<https://llvm.org/devmtg/2019-10/slides/Kumar-HotColdSplitting.pdf> as
>> well as a partial inlining pass
>>
<https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/IPO/PartialInlining.cpp>
>> which performs similar transformations, as noted by the authors in a
>> recent email thread
>>
<https://lists.llvm.org/pipermail/llvm-dev/2020-June/142429.html>.
>> However, due to the timing of the respective passes as well as the code
>> extraction techniques employed, the overall gains on large, complex
>> applications leave headroom for improvement. By deferring function
>> splitting to the codegen phase we can maximize the opportunity to
remove
>> cold code as well as refine the code extraction technique. Furthermore,
by
>> performing function splitting very late, earlier passes can perform
more
>> aggressive optimizations.
>> Implementation
>>
>> We propose a new machine function splitting pass which leverages the
basic
>> block sections feature <https://reviews.llvm.org/D68063> to split
>> functions without the caveats of code extraction in the middle-end. The
>> pass uses profile information to identify cold basic blocks very late
in
>> LLVM CodeGen, after regalloc and all other machine passes have
executed.
>> This allows our implementation to be precise in its assessment of cold
>> regions while maximizing opportunity.
>>
>> Each function is split into two parts. The hot cluster includes the
>> function entry and all blocks which are not cold. All the cold blocks
are
>> grouped together as a Cold Section cluster
>>
<https://github.com/llvm/llvm-project/blob/5934df0c9abe94fc450fbcf0ceca21cf838840e9/llvm/include/llvm/CodeGen/MachineBasicBlock.h#L63>.
>> With basic block sections, the cold blocks are assigned appropriate
debug
>> and call frame information and emitted as part of the .text.unlikely
>> section. Unlike Propeller
>>
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>,
>> which is presently the main user of the basic block sections feature,
this
>> pass does not require an additional round of profiling and uses
existing
>> instrumentation based FDO or CSFDO profile information.
>>
>> [image: Machine Function Splitter.png]
>>
>>
>> In the illustration above, the functions foo and bar contain a cold
block
>> each, index 5 and E respectively. We show a possible layout for these
>> functions which optimizes for fall throughs. Note that all the blocks
are
>> kept in a contiguous region described by the symbols foo and bar. Using
the
>> machine function splitter, the cold blocks (5 and E) are moved to a
>> separate section. These blocks can then be grouped along with other
cold
>> blocks (and functions) in a separate output section in the final
binary.
>> The key highlights of this approach are:
>>
>>    -
>>
>>    Profile driven, profile type agnostic approach.
>>    -
>>
>>    Cold basic blocks are split out using jumps.
>>    -
>>
>>    No additional instructions are added to the function for
>>    setup/teardown.
>>    -
>>
>>    Runs as the last step before emitting assembly, no
>>    analysis/optimizations are hindered.
>>
>>
>> Exceptions
>>
>> All eh pads are grouped together regardless of their coldness and are
>> part of the original function. There are outstanding issues with
splitting
>> eh pads if they reside in separate sections in the binary. This remains
as
>> part of future work.
>>
>> DebugInfo and CFI
>>
>> Debug information and CFI directives are updated and kept consistent by
>> the underlying basic block sections framework. Support added in the
>> following patches
>>
>>    -
>>
>>    DebugInfo (https://reviews.llvm.org/D78851)
>>    -
>>
>>    CFI (https://reviews.llvm.org/D79978).
>>
>>
>>
>> Distinction between Machine Function Splitter and Propeller
>>
>>
>> Full Propeller optimizations include function splitting and layout
>> optimizations, however it requires an additional round of profiling
using
>> perf on top of the peak (FDO/CSFDO + ThinLTO) binary. In this work we
>> experiment with applying function splitting using the instrumented
profile
>> in the build instead of adding an additional round of profiling.
>>
>> Link to Propeller RFC
>>
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>
>>
>>
>> Split Binary Characteristics
>>
>> Binaries produced by the compiler with function splitting enabled
contain
>> additional symbols. A function which has been split into a hot and cold
>> part is non-contiguous. The symbol table entry for the hot part retains
the
>> symbol name of the original function with type FUNC. The symbol for the
>> cold part contains a “.cold” suffix attached to the original symbol
name,
>> the type is not set for this symbol. Using a suffix has been the norm
for
>> such optimizations e.g. -hot-cold-split in LLVM and the prior GCC
>> implementation detailed earlier. We expect standardized tooling to
handle
>> split functions appropriately, e.g demangling works as expected --
>>
>> $ c++filt _Z3foov.cold
>>
>> foo() [clone .cold]
>>
>> Contrast with HotColdSplit (HCS)
>>
>> Function splitting in the middle-end in LLVM employs extraction of cold
>> single-entry-single-exit (SESE) regions into separate functions. In
>> general, the pass has been found to be impactful in reducing code size
by
>> deduplication of cold regions; however our experiments show it does not
>> improve performance of large workloads.
>>
>> The key differences are:
>>
>> Extraction methodology and tradeoffs
>>
>> HCS extracts cold code from SESE regions using a function call. This
may
>> incur a spill and fill of caller registers along with additional setup
and
>> teardown if live values modified in the cold region need to be
communicated
>> back to the original function. This has a couple of implications
>>
>>    1.
>>
>>    The “residue” of each extracted region is non-trivial and there is a
>>    tradeoff between the amount of code that needs to be cold before it
is
>>    profitable to extract. Thus the cost of mischaracterization is high.
>>    2.
>>
>>    Since each SESE region is extracted separately the net reduction in
>>    code size of the original function is less.
>>
>>
>> In contrast, the machine function splitter extracts cold code into a
>> separate section. Control is transferred to cold code via jumps. More
often
>> than not these jumps may already exist as part of the original layout
thus
>> incurring no additional cost. No additional instructions are inserted
to
>> accommodate splitting. Finally, no additional setup/teardown is
necessary
>> for live values modified in cold regions.
>>
>> Pass timing and interaction with other optimizations
>>
>> The HCS pass is run on the IR in the optimizer. This allows it to be
>> target agnostic and allow later stages to merge identical code if
>> necessary. However, there are some drawbacks to this approach. In
>> particular,
>>
>>    1.
>>
>>    Splitting early may miss opportunities introduced by later passes
>>    such as library call inlining and CFG simplification resulting from
a
>>    combination of optimizations. Furthermore, this may not play well
with
>>    optimization passes such as MachineOutliner.
>>    2.
>>
>>    Synergistic optimizations are harder to reason about due to the pass
>>    timing. For example, inlining can be more aggressive if any cold
code
>>    introduced is trimmed.
>>
>>
>> In contrast, the machine function splitter runs as the last step in
>> codegen. This ensures that the opportunity for splitting is maximised
>> without hindering existing analyses and synergistic decisions can be
made
>> in earlier optimization passes. We rely on accurate profile count
>> propagation across optimizations to maximise opportunities. This works
>> particularly well for instrumented profiles while improving the pass
for
>> sampled profiles is ongoing work.
>>
>> We have provided a contrived example in the Appendix which demonstrates
>> the code generated for both approaches. The key differences are
highlighted
>> inline.
>>
>> Evaluation
>>
>> In this section, we present an in-depth evaluation of the impact on
clang
>> bootstrap and summary results for two google internal workloads,
Search1
>> and Search2 as well overall results on the SPECInt 2017 benchmarks. All
>> experiments are conducted on Intel Skylake based systems unless
otherwise
>> noted. Profile guided optimizations using instrumented profiles are
enabled
>> for all builds.
>>
>> clang-bootstrap
>>
>> We pick 500 compiler invocations from a bootstrap build of clang and
then
>> evaluate the performance of a PGO+ThinLTO optimized version with that
of
>> PGO+ThinLTO+Split compiler. For the latter, the final optimized build
>> includes the machine function splitter.
>>
>> Results:
>>
>> We observe a mean 2.33% improvement in end to end runtime. The
>> improvements in runtime are driven by reduction in icache and TLB miss
>> rates. The table below summarizes our experiment, each data point is
>> averaged over multiple iterations. The observed variation for each
metric
>> is < 1%.
>>
>> Event
>>
>> Split (MPKI)
>>
>> Baseline (MPKI)
>>
>> % Reduction
>>
>> itlb_miss
>>
>> 0.87
>>
>> 1.28
>>
>> 31.70
>>
>> stlb_miss
>>
>> 0.08
>>
>> 0.12
>>
>> 32.51
>>
>> l1i_miss
>>
>> 5.98
>>
>> 6.61
>>
>> 9.56
>>
>> l2_miss
>>
>> 0.27
>>
>> 0.34
>>
>> 20.02
>>
>> In this experiment, the function splitting pass moved cold code from
~30K
>> functions in .text and .text.hot. We present a comparison of the binary
>> contents using bloaty <https://github.com/google/bloaty>
>>
>>
>>     FILE SIZE        VM SIZE
>>
>>  --------------  --------------
>>
>>    +23% +8.26Mi   +23% +8.26Mi    .text.unlikely
>>
>>   +6.5%  +761Ki  [ = ]       0    .strtab
>>
>>   +4.8%  +247Ki  +4.8%  +247Ki    .eh_frame
>>
>>   +6.1%  +193Ki  [ = ]       0    .symtab
>>
>>   +8.5% +63.1Ki  +8.5% +63.1Ki    .eh_frame_hdr
>>
>>   +0.3% +31.3Ki  +0.3% +31.3Ki    .rodata
>>
>>   +0.4%      +3  [ = ]       0    [Unmapped]
>>
>>   -0.3%      -8  -0.3%      -8    .init_array
>>
>>   [ = ]       0 -33.3%      -8    [LOAD #4 [RW]]
>>
>>   [ = ]       0  -0.2%    -416    .bss
>>
>>  -57.1% -4.04Mi -57.1% -4.04Mi    .text.hot
>>
>>  -48.4% -4.13Mi -48.4% -4.13Mi    .text
>>
>>   +1.6% +1.35Mi  +0.6%  +430Ki    TOTAL
>>
>> We see that 48% and 57% of code in .text and .text.hot respectively was
>> moved to the .text.unlikely section. We also note a small increase in
>> overall binary size due to the following reasons:
>>
>>    -
>>
>>    Some additional jump instructions may be inserted.
>>    -
>>
>>    Small increase in associated metadata, e.g. debug information.
>>    -
>>
>>    Additional symbols of type foo.cold for cold parts.
>>    -
>>
>>    Alignment requirements for both original and split function parts.
>>
>>
>> Comparison with HotColdSplit
>>
>> For the clang-bootstrap benchmark we also compared the performance of
the
>> hot-cold-split pass with split-machine-functions. We summarize the
results
>> for performance and the characteristics of the binary built by each
pass in
>> the table below. Each metric is presented as change vs the baseline, an
FDO
>> optimized build of clang.
>>
>>
>> Hot Cold Split
>>
>> Machine Function Splitter
>>
>> Performance
>>
>> 1.10%
>>
>> 2.65%
>>
>> .text size
>>
>> -41.5% -2.89Mi
>>
>> -49.2% -3.43Mi
>>
>> .text.hot size
>>
>> -46.9% -2.52Mi
>>
>> -57.1% -3.07Mi
>>
>> Full binary size
>>
>> 9.6% +7.56Mi
>>
>> 1.7% +1.37Mi
>>
>> Note that the increase in overall binary size increase for HCS is due
to
>> the increase in .eh_frame (+61% +3.03Mi). HCS extracts each cold SESE
>> region as a separate function whereas the machine function splitter
>> extracts the cold code as a single region thus incurring a constant
>> overhead per function.
>>
>> Google workloads
>>
>> We evaluated the impact of function splitting on a couple of search
>> workloads, Search1 and Search2. A key difference with respect to the
clang
>> experiment above is the use of huge pages for code. Overall, we find
that
>> on Intel Skylake the key benefit is from reduction of iTLB misses
whereas
>> on AMD the key benefit is from the reduction of icache misses. This is
due
>> to the fewer iTLB entries available for hugepages on Intel
architectures.
>> We find that overall throughput for Search1 and Search2 improve between
>> 0.8% to 1.2%; a significant improvement on these benchmarks. The
workloads
>> are built with FDO and CSFDO respectively. On Intel Skylake, iTLB
misses
>> reduce by 16% to 35%, sTLB misses reduce by 62% to 67%. On AMD, L1
icache
>> misses improve by 1.2% to 2.6% whereas L2 instruction misses improve by
>> 4.8% to 5.1%.
>>
>> Comparison with HotColdSplit
>>
>> An evaluation of the hot-cold-split pass did not yield performance
>> improvements on google workloads.
>>
>> SPECInt 2017
>>
>> We evaluated the impact of the machine function splitter on SPECInt
2017
>> using the int rate metrics. Overall, we found a 1.6% geomean intrate
>> improvement for the benchmarks where performance improved
(500.perlbench_r,
>> 502.gcc_r, 505.mcf_r, 520.omnetpp_r). For the benchmarks that didn’t
>> improve performance, the average degradation was 0.6% (523.xalancbmk_r,
>> 525.x264_r, 531.deepsjeng_r, 541.leela_r).
>>
>> We note that the instruction footprint of SPEC workloads are smaller
than
>> most modern workloads and our work is primarily focused on reducing the
>> footprint to improve performance. These experiments were performed on
Intel
>> Haswell machines.
>>
>> Appendix
>>
>> Example to illustrate hot-cold-split and split-machine-functions
>>
>> Input IR
>>
>> ```
>>
>> @i = external global i32, align 4
>>
>> define i32 @foo(i32 %0, i32 %1) nounwind !prof !1 {
>>
>>   %3 = icmp eq i32 %0, 0
>>
>>   br i1 %3, label %6, label %4, !prof !2
>>
>> 4:                                                ; preds = %2
>>
>>   %5 =  call i32 @L1()
>>
>>   br label %9
>>
>> 6:                                                ; preds = %2
>>
>>   %7 = call i32 @R1()
>>
>>   %8 = add nsw i32 %1, 1
>>
>>   br label %9
>>
>> 9:                                               ; preds = %6, %4
>>
>>   %10 = phi i32 [ %1, %4 ], [ %8, %6 ]
>>
>>   %11 = load i32, i32* @i, align 4
>>
>>   %12 = add nsw i32 %10, %11
>>
>>   store i32 %12, i32* @i, align 4
>>
>>   ret i32 %12
>>
>> }
>>
>> declare i32 @L1()
>>
>> declare i32 @R1() cold nounwind
>>
>> !1 = !{!"function_entry_count", i64 7}
>>
>> !2 = !{!"branch_weights", i32 0, i32 7}
>>
>> ```
>>
>> Code generated by Machine Function Splitter
>>
>> $ llc < example.ll -mtriple=x86_64-unknown-linux-gnu
>> -split-machine-functions
>>
>> ```
>>
>>         .text
>>
>>         .file   "<stdin>"
>>
>>         .globl  foo                             # -- Begin function foo
>>
>>         .p2align        4, 0x90
>>
>>         .type   foo, at function
>>
>> foo:                                    # @foo
>>
>> # %bb.0:
>>
>>         pushq   %rbx
>>
>>         movl    %esi, %ebx
>>
>>         testl   %edi, %edi
>>
>>         je      foo.cold                # Jump to cold code
>>
>> # %bb.1:
>>
>>         callq   L1
>>
>> .LBB0_2:
>>
>>         addl    i(%rip), %ebx
>>
>>         movl    %ebx, i(%rip)
>>
>>         movl    %ebx, %eax
>>
>>         popq    %rbx
>>
>>         retq
>>
>>         .section        .text.unlikely.foo,"ax", at progbits
>>
>> foo.cold:
>>
>>         callq   R1
>>
>>         incl    %ebx                    # Directly increment value
>>
>>         jmp     .LBB0_2
>>
>> .LBB_END0_3:
>>
>>         .size   foo.cold, .LBB_END0_3-foo.cold
>>
>>         .text
>>
>> .Lfunc_end0:
>>
>>         .size   foo, .Lfunc_end0-foo
>>
>>                                         # -- End function
>>
>>         .section        ".note.GNU-stack","", at
progbits
>>
>> ```
>>
>> Code generated by Hot Cold Split
>>
>> $ clang -c -O2 -S -mllvm --hot-cold-split -mllvm
>> --hotcoldsplit-threshold=0 -x ir example.ll
>>
>> ```
>>
>>         .text
>>
>>         .file   "example.ll"
>>
>>         .globl  foo                             # -- Begin function foo
>>
>>         .p2align        4, 0x90
>>
>>         .type   foo, at function
>>
>> foo:                                    # @foo
>>
>> # %bb.0:
>>
>>         pushq   %rbx
>>
>>         subq    $16, %rsp
>>
>>         movl    %esi, %ebx
>>
>>         testl   %edi, %edi
>>
>>         jne     .LBB0_1
>>
>> # %bb.2:                                # Residue block in original
>> function
>>
>>         leaq    12(%rsp), %rsi
>>
>>         movl    %ebx, %edi              # Pass param to increment
>>
>>         callq   foo.cold.1              # Call to cold code
>>
>>         movl    12(%rsp), %ebx          # Fill incremented value from
>> stack
>>
>> .LBB0_3:
>>
>>         addl    i(%rip), %ebx
>>
>>         movl    %ebx, i(%rip)
>>
>>         movl    %ebx, %eax
>>
>>         addq    $16, %rsp
>>
>>         popq    %rbx
>>
>>         retq
>>
>> .LBB0_1:
>>
>>         callq   L1
>>
>>         jmp     .LBB0_3
>>
>> .Lfunc_end0:
>>
>>         .size   foo, .Lfunc_end0-foo
>>
>>                                         # -- End function
>>
>>         .p2align        4, 0x90                         # -- Begin
>> function foo.cold.1
>>
>>         .type   foo.cold.1, at function
>>
>> foo.cold.1:                             # @foo.cold.1
>>
>> # %bb.0:                                # %newFuncRoot
>>
>>         pushq   %rbp
>>
>>         pushq   %rbx
>>
>>         pushq   %rax
>>
>>         movq    %rsi, %rbx
>>
>>         movl    %edi, %ebp
>>
>>         callq   R1
>>
>>         incl    %ebp
>>
>>         movl    %ebp, (%rbx)
>>
>>         addq    $8, %rsp
>>
>>         popq    %rbx
>>
>>         popq    %rbp
>>
>>         retq
>>
>> .Lfunc_end1:
>>
>>         .size   foo.cold.1, .Lfunc_end1-foo.cold.1
>>
>>                                         # -- End function
>>
>>         .cg_profile foo, L1, 0
>>
>>         .cg_profile foo, foo.cold.1, 7
>>
>>         .section        ".note.GNU-stack","", at
progbits
>>
>>         .addrsig
>>
>>         .addrsig_sym foo.cold.1
>>
>> ```
>>
>> Thanks,
>> Snehasish Kumar
>> Software Engineer, Google
>>
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200805/29776d80/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Machine Function Splitter.png
Type: image/png
Size: 49967 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200805/29776d80/attachment-0001.png>

Snehasish Kumar via llvm-dev

2020-Aug-05 23:05 UTC

head link

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

Thanks for the response Aditya! As David pointed out we are focused on
performance for a specific target and thus find that performing splitting
at a later stage yields better results.
I've also uploaded the diff for review at https://reviews.llvm.org/D85368.
Please feel free to comment on the patch as well.

On Wed, Aug 5, 2020 at 9:48 AM Xinliang David Li <davidxl at google.com>
wrote:
>
>
> On Tue, Aug 4, 2020 at 10:51 PM aditya kumar <hiraditya at gmail.com>
wrote:
>
>> Glad to hear that there is an interest in a function splitting pass.
>> There are advantages to splitting functions at different stages as
you've
>> already noted.
>>
>
> Right -- with slightly different objectives. Machine Function Splitting
> Pass's main focus is on performance improvement.
>
>> - Having a target independent function splitting scales well to LTO,
>> ThinLTO, supporting multiple architectures and offers ease of
maintenance.
>> - While HCS+merge-function helps significantly reduce the codesize, in
>> many cases the outlined functions tend to have identical function
bodies
>> (e.g., assert-fail etc); they can be deduplicated by linker with
careful
>> function naming. This reduces code-size regardless of function merging
and
>> across the entire program. This technique should also help Machine
Function
>> splitter in some cases but at some cost to the link time.
>>
>
> yes -- I think this can also be achieved with the partial inlining pass.
>
> - It will be difficult to reduce argument setup and restore code in the
>> HCS except in some cases like tail call, internal function calls,
>> non-returning function calls etc. Having frame setup however should
help
>> with debugging IMO.
>>
> I would like to point out that we do have frame setup for basic blocksections to ease debugging (https://reviews.llvm.org/D79978). Machine
function splitter leverages the basic block sections feature to implement
splitting.
>
>> > Recent work at Google has shown that aggressive, profile-driven
>> inlining for performance has led to significant code bloat and icache
>> fragmentation
>>
>> Because the inliner can be too aggressive at times and can negatively
affect icache-miss etc, HCS can be integrated with inliner to assist in partial
inlining (For example: split the callee before inlining). Rodrigo (@rcorcs) has
some ideas around that and we've been exploring that as part of GSoC
project.
>>
>>
> The Inliner can also be hindered without splitting. Partial inlining can
> help a little, but it can be limited because many of the outlining
> opportunities are only exposed after inlining (in inline instances).
>  Making inliner Machine Splitting or HCS aware is the way to go.   Machine
> Function splitting has an advantage here as it does not need sophisticated
> analysis to figure out what part of code can and can not be split out post
> inlining.
>
>
>> We've performance numbers from Firefox which shows ~5% performance
>> improvement with HCS (cc: Ruijie @rjf). Vedant also reported
performance
>> numbers across iOS and Swift benchmarks in the past. I could find (
>> https://github.com/apple/swift/pull/21016) which reported decent
>> performance improvement in core ios Frameworks.
>>
>>
> Nice. If possible, the same performance tests can be done for Machine
> Splitting once the patch is posted :)
>
>
>
>> That said we have been working on improving the cost model which I
think
>> will help alleviate many of the limitations that we typically don't
have in
>> a Machine Splitting optimization. I'd like to hear your ideas on
how the
>> cost model can be improved.
>>
>>
> The partial inliner pass has introduced many cost analysis using profile
> data. I think HCS should probably share those code (as utilities) as the
> nature of the transformation is similar. Cost model can sometimes be quite
> tricky though -- it is hard to compare the cost with the actual benefit
> brought by the splitting.   The beauty of machine splitting is that it does
> not depend on sophisticated cost/benefit model.
>The pass itself is fairly simple (see MachineFunctionSplitter.cpp,
https://reviews.llvm.org/D85368) and there is no cost benefit model as
David pointed out.
>
>
>
>> > In contrast, the machine function splitter extracts cold code into
a separate
>> section.
>> HCS also adds a section prefix to all the cold functions. It is
possible
>> that the cold functions are still in the same section as the hot one
>> depending on the linker. Ruijie has a patch to move all the cold
functions
>> to a separate section, we are still evaluating the results (
>>
https://github.com/ruijiefang/llvm-hcs/commit/4966997e135050c99b4adc0aa971242af99ba832).
>> In case it is not difficult to rerun the experiments, it'll help to
see the
>> numbers with the llvm-trunk and this patch from @rjf.
>>
>> > Furthermore, this may not play well with optimization passes such
as
>> MachineOutliner.
>> Can you share an example where HCS and Machine Outliner don't play
well
>> together?
>>
>
> One thing I can think of is that Machine Outliner is based on code
> pattern, while HCS also looks at hotness. The inconsistency can lead to
> missing opportunities in machine outliner ?
>
>>
>> > Synergistic optimizations are harder to reason about due to the
pass
>> timing. For example, inlining can be more aggressive if any cold code
introduced
>> is trimmed.
>> How does this regress workloads if we have profile information and cold
>> portions of a callee is outlined? Does the inliner always regress
workloads
>> we are evaluating?
>>
>>
> Currently inliner only looks at static code size (as cost). We are working
> on improving it.
>
>
>
>> PS: One correction I'd like to make is HCS splits SEME regions
(Thanks to
>> Vedant). In case some SEME aren't getting outlined, there are CFG
>> transformations to make them friendly to HCS. I'd love to see such
an
>> example, as that'd motivate some of the future work.
>>
> Thanks for pointing this out.
>
>>
> It is common to see multiple entry cold regions post inlining.  After
> block layout, we may see long chains of cold blocks with many blocks in the
> chain being jump targets (from hot regions).   If there are multiple blocks
> exiting the cold region, we get multiple exits case.
>
> thanks,
>
> David
>
>
>
>
>
>> Aditya Kumar
>> Compiler Engineer
>> https://bitsimplify.com
>>
>>
>> On Tue, Aug 4, 2020 at 5:31 PM Snehasish Kumar <snehasishk at
google.com>
>> wrote:
>>
>>> Greetings,
>>>
>>> We present “Machine Function Splitter”, a codegen optimization pass
>>> which splits functions into hot and cold parts. This pass leverages
the
>>> basic block sections feature recently introduced in LLVM from the
Propeller
>>> project. The pass targets functions with profile coverage,
identifies cold
>>> blocks and moves them to a separate section. The linker groups all
cold
>>> blocks across functions together, decreasing fragmentation and
improving
>>> icache and itlb utilization. Our experiments show >2%
performance
>>> improvement on clang bootstrap, ~1% improvement on Google workloads
and
>>> 1.6% mean performance improvement on SPEC IntRate 2017.
>>> Motivation
>>>
>>> Recent work at Google has shown that aggressive, profile-driven
inlining
>>> for performance has led to significant code bloat and icache
fragmentation (AsmDB
>>> - Ayers et al ‘2019
<https://research.google/pubs/pub48320/>). We find
>>> that most functions 5 KiB or larger have inlined children more than
10
>>> layers deep bringing in exponentially more code at each inline
level, not
>>> all of which is necessarily hot. Generally, in roughly half of even
the
>>> hottest functions, more than 50% of the code bytes are never
executed, but
>>> likely to be in the cache.
>>>
>>> Function splitting is a well known compiler transformation
primarily
>>> targeting improved code locality to improve performance. LLVM has a
>>> middle-end, target agnostic hot cold splitting pass
>>>
<https://llvm.org/devmtg/2019-10/slides/Kumar-HotColdSplitting.pdf> as
>>> well as a partial inlining pass
>>>
<https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/IPO/PartialInlining.cpp>
>>> which performs similar transformations, as noted by the authors in
a
>>> recent email thread
>>>
<https://lists.llvm.org/pipermail/llvm-dev/2020-June/142429.html>.
>>> However, due to the timing of the respective passes as well as the
code
>>> extraction techniques employed, the overall gains on large, complex
>>> applications leave headroom for improvement. By deferring function
>>> splitting to the codegen phase we can maximize the opportunity to
remove
>>> cold code as well as refine the code extraction technique.
Furthermore, by
>>> performing function splitting very late, earlier passes can perform
more
>>> aggressive optimizations.
>>> Implementation
>>>
>>> We propose a new machine function splitting pass which leverages
the basic
>>> block sections feature <https://reviews.llvm.org/D68063> to
split
>>> functions without the caveats of code extraction in the middle-end.
The
>>> pass uses profile information to identify cold basic blocks very
late in
>>> LLVM CodeGen, after regalloc and all other machine passes have
executed.
>>> This allows our implementation to be precise in its assessment of
cold
>>> regions while maximizing opportunity.
>>>
>>> Each function is split into two parts. The hot cluster includes the
>>> function entry and all blocks which are not cold. All the cold
blocks are
>>> grouped together as a Cold Section cluster
>>>
<https://github.com/llvm/llvm-project/blob/5934df0c9abe94fc450fbcf0ceca21cf838840e9/llvm/include/llvm/CodeGen/MachineBasicBlock.h#L63>.
>>> With basic block sections, the cold blocks are assigned appropriate
debug
>>> and call frame information and emitted as part of the
.text.unlikely
>>> section. Unlike Propeller
>>>
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>,
>>> which is presently the main user of the basic block sections
feature, this
>>> pass does not require an additional round of profiling and uses
existing
>>> instrumentation based FDO or CSFDO profile information.
>>>
>>> [image: Machine Function Splitter.png]
>>>
>>>
>>> In the illustration above, the functions foo and bar contain a cold
>>> block each, index 5 and E respectively. We show a possible layout
for these
>>> functions which optimizes for fall throughs. Note that all the
blocks are
>>> kept in a contiguous region described by the symbols foo and bar.
Using the
>>> machine function splitter, the cold blocks (5 and E) are moved to a
>>> separate section. These blocks can then be grouped along with other
cold
>>> blocks (and functions) in a separate output section in the final
binary.
>>> The key highlights of this approach are:
>>>
>>>    -
>>>
>>>    Profile driven, profile type agnostic approach.
>>>    -
>>>
>>>    Cold basic blocks are split out using jumps.
>>>    -
>>>
>>>    No additional instructions are added to the function for
>>>    setup/teardown.
>>>    -
>>>
>>>    Runs as the last step before emitting assembly, no
>>>    analysis/optimizations are hindered.
>>>
>>>
>>> Exceptions
>>>
>>> All eh pads are grouped together regardless of their coldness and
are
>>> part of the original function. There are outstanding issues with
splitting
>>> eh pads if they reside in separate sections in the binary. This
remains as
>>> part of future work.
>>>
>>> DebugInfo and CFI
>>>
>>> Debug information and CFI directives are updated and kept
consistent by
>>> the underlying basic block sections framework. Support added in the
>>> following patches
>>>
>>>    -
>>>
>>>    DebugInfo (https://reviews.llvm.org/D78851)
>>>    -
>>>
>>>    CFI (https://reviews.llvm.org/D79978).
>>>
>>>
>>>
>>> Distinction between Machine Function Splitter and Propeller
>>>
>>>
>>> Full Propeller optimizations include function splitting and layout
>>> optimizations, however it requires an additional round of profiling
using
>>> perf on top of the peak (FDO/CSFDO + ThinLTO) binary. In this work
we
>>> experiment with applying function splitting using the instrumented
profile
>>> in the build instead of adding an additional round of profiling.
>>>
>>> Link to Propeller RFC
>>>
<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html>
>>>
>>>
>>> Split Binary Characteristics
>>>
>>> Binaries produced by the compiler with function splitting enabled
>>> contain additional symbols. A function which has been split into a
hot and
>>> cold part is non-contiguous. The symbol table entry for the hot
part
>>> retains the symbol name of the original function with type FUNC.
The symbol
>>> for the cold part contains a “.cold” suffix attached to the
original symbol
>>> name, the type is not set for this symbol. Using a suffix has been
the norm
>>> for such optimizations e.g. -hot-cold-split in LLVM and the prior
GCC
>>> implementation detailed earlier. We expect standardized tooling to
handle
>>> split functions appropriately, e.g demangling works as expected --
>>>
>>> $ c++filt _Z3foov.cold
>>>
>>> foo() [clone .cold]
>>>
>>> Contrast with HotColdSplit (HCS)
>>>
>>> Function splitting in the middle-end in LLVM employs extraction of
cold
>>> single-entry-single-exit (SESE) regions into separate functions. In
>>> general, the pass has been found to be impactful in reducing code
size by
>>> deduplication of cold regions; however our experiments show it does
not
>>> improve performance of large workloads.
>>>
>>> The key differences are:
>>>
>>> Extraction methodology and tradeoffs
>>>
>>> HCS extracts cold code from SESE regions using a function call.
This may
>>> incur a spill and fill of caller registers along with additional
setup and
>>> teardown if live values modified in the cold region need to be
communicated
>>> back to the original function. This has a couple of implications
>>>
>>>    1.
>>>
>>>    The “residue” of each extracted region is non-trivial and there
is a
>>>    tradeoff between the amount of code that needs to be cold before
it is
>>>    profitable to extract. Thus the cost of mischaracterization is
high.
>>>    2.
>>>
>>>    Since each SESE region is extracted separately the net reduction
in
>>>    code size of the original function is less.
>>>
>>>
>>> In contrast, the machine function splitter extracts cold code into
a
>>> separate section. Control is transferred to cold code via jumps.
More often
>>> than not these jumps may already exist as part of the original
layout thus
>>> incurring no additional cost. No additional instructions are
inserted to
>>> accommodate splitting. Finally, no additional setup/teardown is
necessary
>>> for live values modified in cold regions.
>>>
>>> Pass timing and interaction with other optimizations
>>>
>>> The HCS pass is run on the IR in the optimizer. This allows it to
be
>>> target agnostic and allow later stages to merge identical code if
>>> necessary. However, there are some drawbacks to this approach. In
>>> particular,
>>>
>>>    1.
>>>
>>>    Splitting early may miss opportunities introduced by later
passes
>>>    such as library call inlining and CFG simplification resulting
from a
>>>    combination of optimizations. Furthermore, this may not play
well with
>>>    optimization passes such as MachineOutliner.
>>>    2.
>>>
>>>    Synergistic optimizations are harder to reason about due to the
pass
>>>    timing. For example, inlining can be more aggressive if any cold
code
>>>    introduced is trimmed.
>>>
>>>
>>> In contrast, the machine function splitter runs as the last step in
>>> codegen. This ensures that the opportunity for splitting is
maximised
>>> without hindering existing analyses and synergistic decisions can
be made
>>> in earlier optimization passes. We rely on accurate profile count
>>> propagation across optimizations to maximise opportunities. This
works
>>> particularly well for instrumented profiles while improving the
pass for
>>> sampled profiles is ongoing work.
>>>
>>> We have provided a contrived example in the Appendix which
demonstrates
>>> the code generated for both approaches. The key differences are
highlighted
>>> inline.
>>>
>>> Evaluation
>>>
>>> In this section, we present an in-depth evaluation of the impact on
>>> clang bootstrap and summary results for two google internal
workloads,
>>> Search1 and Search2 as well overall results on the SPECInt 2017
benchmarks.
>>> All experiments are conducted on Intel Skylake based systems unless
>>> otherwise noted. Profile guided optimizations using instrumented
profiles
>>> are enabled for all builds.
>>>
>>> clang-bootstrap
>>>
>>> We pick 500 compiler invocations from a bootstrap build of clang
and
>>> then evaluate the performance of a PGO+ThinLTO optimized version
with that
>>> of PGO+ThinLTO+Split compiler. For the latter, the final optimized
build
>>> includes the machine function splitter.
>>>
>>> Results:
>>>
>>> We observe a mean 2.33% improvement in end to end runtime. The
>>> improvements in runtime are driven by reduction in icache and TLB
miss
>>> rates. The table below summarizes our experiment, each data point
is
>>> averaged over multiple iterations. The observed variation for each
metric
>>> is < 1%.
>>>
>>> Event
>>>
>>> Split (MPKI)
>>>
>>> Baseline (MPKI)
>>>
>>> % Reduction
>>>
>>> itlb_miss
>>>
>>> 0.87
>>>
>>> 1.28
>>>
>>> 31.70
>>>
>>> stlb_miss
>>>
>>> 0.08
>>>
>>> 0.12
>>>
>>> 32.51
>>>
>>> l1i_miss
>>>
>>> 5.98
>>>
>>> 6.61
>>>
>>> 9.56
>>>
>>> l2_miss
>>>
>>> 0.27
>>>
>>> 0.34
>>>
>>> 20.02
>>>
>>> In this experiment, the function splitting pass moved cold code
from
>>> ~30K functions in .text and .text.hot. We present a comparison of
the
>>> binary contents using bloaty
<https://github.com/google/bloaty>
>>>
>>>
>>>     FILE SIZE        VM SIZE
>>>
>>>  --------------  --------------
>>>
>>>    +23% +8.26Mi   +23% +8.26Mi    .text.unlikely
>>>
>>>   +6.5%  +761Ki  [ = ]       0    .strtab
>>>
>>>   +4.8%  +247Ki  +4.8%  +247Ki    .eh_frame
>>>
>>>   +6.1%  +193Ki  [ = ]       0    .symtab
>>>
>>>   +8.5% +63.1Ki  +8.5% +63.1Ki    .eh_frame_hdr
>>>
>>>   +0.3% +31.3Ki  +0.3% +31.3Ki    .rodata
>>>
>>>   +0.4%      +3  [ = ]       0    [Unmapped]
>>>
>>>   -0.3%      -8  -0.3%      -8    .init_array
>>>
>>>   [ = ]       0 -33.3%      -8    [LOAD #4 [RW]]
>>>
>>>   [ = ]       0  -0.2%    -416    .bss
>>>
>>>  -57.1% -4.04Mi -57.1% -4.04Mi    .text.hot
>>>
>>>  -48.4% -4.13Mi -48.4% -4.13Mi    .text
>>>
>>>   +1.6% +1.35Mi  +0.6%  +430Ki    TOTAL
>>>
>>> We see that 48% and 57% of code in .text and .text.hot respectively
was
>>> moved to the .text.unlikely section. We also note a small increase
in
>>> overall binary size due to the following reasons:
>>>
>>>    -
>>>
>>>    Some additional jump instructions may be inserted.
>>>    -
>>>
>>>    Small increase in associated metadata, e.g. debug information.
>>>    -
>>>
>>>    Additional symbols of type foo.cold for cold parts.
>>>    -
>>>
>>>    Alignment requirements for both original and split function
parts.
>>>
>>>
>>> Comparison with HotColdSplit
>>>
>>> For the clang-bootstrap benchmark we also compared the performance
of
>>> the hot-cold-split pass with split-machine-functions. We summarize
the
>>> results for performance and the characteristics of the binary built
by each
>>> pass in the table below. Each metric is presented as change vs the
>>> baseline, an FDO optimized build of clang.
>>>
>>>
>>> Hot Cold Split
>>>
>>> Machine Function Splitter
>>>
>>> Performance
>>>
>>> 1.10%
>>>
>>> 2.65%
>>>
>>> .text size
>>>
>>> -41.5% -2.89Mi
>>>
>>> -49.2% -3.43Mi
>>>
>>> .text.hot size
>>>
>>> -46.9% -2.52Mi
>>>
>>> -57.1% -3.07Mi
>>>
>>> Full binary size
>>>
>>> 9.6% +7.56Mi
>>>
>>> 1.7% +1.37Mi
>>>
>>> Note that the increase in overall binary size increase for HCS is
due to
>>> the increase in .eh_frame (+61% +3.03Mi). HCS extracts each cold
SESE
>>> region as a separate function whereas the machine function splitter
>>> extracts the cold code as a single region thus incurring a constant
>>> overhead per function.
>>>
>>> Google workloads
>>>
>>> We evaluated the impact of function splitting on a couple of search
>>> workloads, Search1 and Search2. A key difference with respect to
the clang
>>> experiment above is the use of huge pages for code. Overall, we
find that
>>> on Intel Skylake the key benefit is from reduction of iTLB misses
whereas
>>> on AMD the key benefit is from the reduction of icache misses. This
is due
>>> to the fewer iTLB entries available for hugepages on Intel
architectures.
>>> We find that overall throughput for Search1 and Search2 improve
between
>>> 0.8% to 1.2%; a significant improvement on these benchmarks. The
workloads
>>> are built with FDO and CSFDO respectively. On Intel Skylake, iTLB
misses
>>> reduce by 16% to 35%, sTLB misses reduce by 62% to 67%. On AMD, L1
icache
>>> misses improve by 1.2% to 2.6% whereas L2 instruction misses
improve by
>>> 4.8% to 5.1%.
>>>
>>> Comparison with HotColdSplit
>>>
>>> An evaluation of the hot-cold-split pass did not yield performance
>>> improvements on google workloads.
>>>
>>> SPECInt 2017
>>>
>>> We evaluated the impact of the machine function splitter on SPECInt
2017
>>> using the int rate metrics. Overall, we found a 1.6% geomean
intrate
>>> improvement for the benchmarks where performance improved
(500.perlbench_r,
>>> 502.gcc_r, 505.mcf_r, 520.omnetpp_r). For the benchmarks that
didn’t
>>> improve performance, the average degradation was 0.6%
(523.xalancbmk_r,
>>> 525.x264_r, 531.deepsjeng_r, 541.leela_r).
>>>
>>> We note that the instruction footprint of SPEC workloads are
smaller
>>> than most modern workloads and our work is primarily focused on
reducing
>>> the footprint to improve performance. These experiments were
performed on
>>> Intel Haswell machines.
>>>
>>> Appendix
>>>
>>> Example to illustrate hot-cold-split and split-machine-functions
>>>
>>> Input IR
>>>
>>> ```
>>>
>>> @i = external global i32, align 4
>>>
>>> define i32 @foo(i32 %0, i32 %1) nounwind !prof !1 {
>>>
>>>   %3 = icmp eq i32 %0, 0
>>>
>>>   br i1 %3, label %6, label %4, !prof !2
>>>
>>> 4:                                                ; preds = %2
>>>
>>>   %5 =  call i32 @L1()
>>>
>>>   br label %9
>>>
>>> 6:                                                ; preds = %2
>>>
>>>   %7 = call i32 @R1()
>>>
>>>   %8 = add nsw i32 %1, 1
>>>
>>>   br label %9
>>>
>>> 9:                                               ; preds = %6, %4
>>>
>>>   %10 = phi i32 [ %1, %4 ], [ %8, %6 ]
>>>
>>>   %11 = load i32, i32* @i, align 4
>>>
>>>   %12 = add nsw i32 %10, %11
>>>
>>>   store i32 %12, i32* @i, align 4
>>>
>>>   ret i32 %12
>>>
>>> }
>>>
>>> declare i32 @L1()
>>>
>>> declare i32 @R1() cold nounwind
>>>
>>> !1 = !{!"function_entry_count", i64 7}
>>>
>>> !2 = !{!"branch_weights", i32 0, i32 7}
>>>
>>> ```
>>>
>>> Code generated by Machine Function Splitter
>>>
>>> $ llc < example.ll -mtriple=x86_64-unknown-linux-gnu
>>> -split-machine-functions
>>>
>>> ```
>>>
>>>         .text
>>>
>>>         .file   "<stdin>"
>>>
>>>         .globl  foo                             # -- Begin function
foo
>>>
>>>         .p2align        4, 0x90
>>>
>>>         .type   foo, at function
>>>
>>> foo:                                    # @foo
>>>
>>> # %bb.0:
>>>
>>>         pushq   %rbx
>>>
>>>         movl    %esi, %ebx
>>>
>>>         testl   %edi, %edi
>>>
>>>         je      foo.cold                # Jump to cold code
>>>
>>> # %bb.1:
>>>
>>>         callq   L1
>>>
>>> .LBB0_2:
>>>
>>>         addl    i(%rip), %ebx
>>>
>>>         movl    %ebx, i(%rip)
>>>
>>>         movl    %ebx, %eax
>>>
>>>         popq    %rbx
>>>
>>>         retq
>>>
>>>         .section        .text.unlikely.foo,"ax", at
progbits
>>>
>>> foo.cold:
>>>
>>>         callq   R1
>>>
>>>         incl    %ebx                    # Directly increment value
>>>
>>>         jmp     .LBB0_2
>>>
>>> .LBB_END0_3:
>>>
>>>         .size   foo.cold, .LBB_END0_3-foo.cold
>>>
>>>         .text
>>>
>>> .Lfunc_end0:
>>>
>>>         .size   foo, .Lfunc_end0-foo
>>>
>>>                                         # -- End function
>>>
>>>         .section        ".note.GNU-stack","",
at progbits
>>>
>>> ```
>>>
>>> Code generated by Hot Cold Split
>>>
>>> $ clang -c -O2 -S -mllvm --hot-cold-split -mllvm
>>> --hotcoldsplit-threshold=0 -x ir example.ll
>>>
>>> ```
>>>
>>>         .text
>>>
>>>         .file   "example.ll"
>>>
>>>         .globl  foo                             # -- Begin function
foo
>>>
>>>         .p2align        4, 0x90
>>>
>>>         .type   foo, at function
>>>
>>> foo:                                    # @foo
>>>
>>> # %bb.0:
>>>
>>>         pushq   %rbx
>>>
>>>         subq    $16, %rsp
>>>
>>>         movl    %esi, %ebx
>>>
>>>         testl   %edi, %edi
>>>
>>>         jne     .LBB0_1
>>>
>>> # %bb.2:                                # Residue block in original
>>> function
>>>
>>>         leaq    12(%rsp), %rsi
>>>
>>>         movl    %ebx, %edi              # Pass param to increment
>>>
>>>         callq   foo.cold.1              # Call to cold code
>>>
>>>         movl    12(%rsp), %ebx          # Fill incremented value
from
>>> stack
>>>
>>> .LBB0_3:
>>>
>>>         addl    i(%rip), %ebx
>>>
>>>         movl    %ebx, i(%rip)
>>>
>>>         movl    %ebx, %eax
>>>
>>>         addq    $16, %rsp
>>>
>>>         popq    %rbx
>>>
>>>         retq
>>>
>>> .LBB0_1:
>>>
>>>         callq   L1
>>>
>>>         jmp     .LBB0_3
>>>
>>> .Lfunc_end0:
>>>
>>>         .size   foo, .Lfunc_end0-foo
>>>
>>>                                         # -- End function
>>>
>>>         .p2align        4, 0x90                         # -- Begin
>>> function foo.cold.1
>>>
>>>         .type   foo.cold.1, at function
>>>
>>> foo.cold.1:                             # @foo.cold.1
>>>
>>> # %bb.0:                                # %newFuncRoot
>>>
>>>         pushq   %rbp
>>>
>>>         pushq   %rbx
>>>
>>>         pushq   %rax
>>>
>>>         movq    %rsi, %rbx
>>>
>>>         movl    %edi, %ebp
>>>
>>>         callq   R1
>>>
>>>         incl    %ebp
>>>
>>>         movl    %ebp, (%rbx)
>>>
>>>         addq    $8, %rsp
>>>
>>>         popq    %rbx
>>>
>>>         popq    %rbp
>>>
>>>         retq
>>>
>>> .Lfunc_end1:
>>>
>>>         .size   foo.cold.1, .Lfunc_end1-foo.cold.1
>>>
>>>                                         # -- End function
>>>
>>>         .cg_profile foo, L1, 0
>>>
>>>         .cg_profile foo, foo.cold.1, 7
>>>
>>>         .section        ".note.GNU-stack","",
at progbits
>>>
>>>         .addrsig
>>>
>>>         .addrsig_sym foo.cold.1
>>>
>>> ```
>>>
>>> Thanks,
>>> Snehasish Kumar
>>> Software Engineer, Google
>>>
>>>
>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200805/797ae66a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Machine Function Splitter.png
Type: image/png
Size: 49967 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200805/797ae66a/attachment-0001.png>

Jessica Paquette via llvm-dev

2020-Aug-11 16:14 UTC

head link

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

<html><head></head><body dir="auto"
style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break:
after-white-space;"><meta http-equiv="Content-Type"
content="text/html; charset=us-ascii"><div
style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break:
after-white-space;" class=""><div><blockquote
type="cite" class=""><div
class=""><div dir="ltr" class=""><div
class="gmail_quote"><div class=""><br
class=""></div><div class="">
</div><blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"><div dir="ltr"
class=""><div dir="ltr" class=""><div
dir="ltr" class=""><div
class=""></div><div class="">> <span
style="white-space: pre-wrap;" class="">In contrast, the
machine function splitter extracts cold code into a </span><span
style="white-space: pre-wrap;" class="">separate section.
</span></div><div class="">HCS also adds a section
prefix to all the cold functions. It is possible that the cold functions are
still in the same section as the hot one depending on the linker. Ruijie has a
patch to move all the cold functions to a separate section, we are still
evaluating the results (<a
href="https://github.com/ruijiefang/llvm-hcs/commit/4966997e135050c99b4adc0aa971242af99ba832"
target="_blank"
class="">https://github.com/ruijiefang/llvm-hcs/commit/4966997e135050c99b4adc0aa971242af99ba832</a>).
In case it is not difficult to rerun the experiments, it'll help to see the
numbers with the llvm-trunk and this patch from @rjf.</div><div
class=""><br class=""></div><div
class="">> <span style="white-space: pre-wrap;"
class="">Furthermore, this may not play well
with</span><span style="white-space: pre-wrap;"
class=""> optimization passes such as
MachineOutliner.</span></div><div class=""><span
style="white-space: pre-wrap;" class="">Can you share an
example where HCS and Machine Outliner don't play well
together?</span></div></div></div></div></blockquote><div
class=""><br class=""></div><div
class="gmail_default" style="font-family: monospace; font-size:
small;">One thing I can think of is that Machine Outliner is based on
code pattern, while HCS also looks at hotness. The inconsistency can lead to
missing opportunities in machine outliner ?</div><blockquote
class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div
dir="ltr" class=""><div dir="ltr"
class=""><div dir="ltr" class=""><div
class=""><span style="white-space: pre-wrap;"
class=""><br
class=""></span></div></div></div></div></blockquote></div></div></div></blockquote></div><div
class=""><br class=""></div><div
class="">Just chiming in about the outliner stuff. (In general, I
think it's desirable to have multiple options for how early/late a pass
runs.)</div><div class=""><br
class=""></div><div class="">From what I
understand, HCS actually tries to leverage the MachineOutliner (and other size
optimizations) by marking cold regions as minsize. (Commit:
03aaa3e2aa37b311999c6af567871325c2fa049f) This is the case at least in AArch64,
where the MachineOutliner is a default minsize optimization.</div><div
class=""><br class=""></div><div
class="">The only missed opportunity I can think of here is
something like the following pseudocode:</div><div
class=""><br class=""></div><div
class=""><br class=""></div><div
class="">hot_func_1:</div><div class=""> 
..</div><div class="">  I1</div><div
class="">  I2</div><div class=""> 
...</div><div class="">  IN</div><div
class="">  ret</div><div class=""><br
class=""></div><div class=""><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">hot_func_2:</div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""> 
..</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0,
0);" class="">  I1</div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""> 
I2</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0,
0);" class="">  ...</div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">  IN</div><div style="caret-color: rgb(0, 0,
0); color: rgb(0, 0, 0);" class=""> 
ret</div></div><div style="caret-color: rgb(0, 0, 0); color:
rgb(0, 0, 0);" class=""><br
class=""></div><div class=""><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">cold_func:</div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""> 
..</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0,
0);" class="">  I1</div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""> 
I2</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0,
0);" class="">  ...</div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">  IN</div><div style="caret-color: rgb(0, 0,
0); color: rgb(0, 0, 0);" class="">  ret</div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class=""><br class=""></div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class=""><br class=""></div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">Let's say that on this target, we can outline tail
calls, and that tail calls don't incur any execution time cost. In this
case, we probably <i class="">do </i>want the following
pseudo-assembly after hot-cold splitting:</div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class=""><br class=""></div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class=""><br class=""></div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">hot_func_1:</div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""> 
...</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0,
0);" class="">  tail_call
OUTLINED_FUNCTION_N</div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class=""><br
class=""></div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class=""><div
class="">hot_func_2:</div><div class=""> 
...</div><div class="">  tail_call
OUTLINED_FUNCTION_N</div></div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br
class=""></div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class=""><div
class="">cold_func:</div><div class=""> 
...</div><div class="">  tail_call
OUTLINED_FUNCTION_N</div></div><div style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br
class=""></div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class=""><br
class=""></div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class="">Since HCS will only give
cold_func the minsize attribute, this can't happen. However, this isn't
really a matter of where you run HCS. This is more that the MachineOutliner
doesn't have a way to recognize that certain outlining styles are cheaper
than others.</div><div style="caret-color: rgb(0, 0, 0); color:
rgb(0, 0, 0);" class=""><br
class=""></div><div style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0);" class="">Also +Vedant who's thought
about how HCS + the MachineOutliner interact.</div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class=""><br class=""></div><div
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"
class="">-
Jessica</div></div></div></body></html>

aditya kumar via llvm-dev

2020-Aug-12 17:24 UTC

head link

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

> Just chiming in about the outliner stuff. (In general, I think it'sdesirable to have multiple options for how early/late a pass runs.)

I'm wondering if MachineOutliner can be augmented to add
MachineFunctionSplitter functionalities as well. If the analysis part of
MachineOutliner can allow single basic block outlining with some cost
models.

Aditya Kumar
Compiler Engineer
https://bitsimplify.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200812/c9322eec/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Aug 2020 - [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

[llvm-dev] [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

Maybe Matching Threads