thr3ads.net - llvm dev - [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2019-Sep-26 00:58 UTC

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> My biggest question about this architecture is about when propeller runs
> basic block reordering within a function.  It seems like a lot of the
> complexity comes from using the proposed -fbasicblock-sections to generated
> mangled ELF, and then re-parsing the mangled ELF as a separate step. 
I'm
> not sure that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like
> you described.  And it's likely more complex on non-x86 targets, which
have
> a greater variety of conditional branches.  And the reordering itself
> involves a bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block
> reordering and function splitting by manipulating the ELF files, you could
> perform reordering and splitting as part of link-time code generation, as
> an MIR pass that runs just before the assembly printer.  MIR is almost
> exactly the form you want for this sort of manipulation: it has basic
> blocks which correspond closely to the final binary, and a high-level
> representation of branch instructions.

This was considered for Propeller.  This  is currently being explored in a
similar way as an alternative of CSFDO which uses PMU samples.

> And it's before the DWARF/CFI emission, so you don't need to worry
about
> fixing them afterwards.  This should take less code overall, and much less
> target-specific code. And infrastructure for function splitting would be
> useful for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of.  You lose
> a little flexibility, in that you can't mix blocks from different
functions
> together, but you aren't doing that anyway, from your description?

One of the main design objectives of Propeller is to have the capability to
do interprocedural code transformations (reordering, cloning, dedupping
etc), so this won't be a minor downside. Function/block alignment (for
branch misprediction reduction etc) will also need to be done as a global
optimization in the future.

David


> It might take more CPU time to re-do LTO code generation, but that's
> probably not a big deal for applications where the user is willing to
> collect two different kinds PGO of profiles for a program.
>
> I understand you can't use a regular PGO profile for this, but it
should
> be possible to pass in a Propeller-PGO profile separately to the compiler,
> and apply that to the MIR as a late pass in the compiler, separate from any
> regular PGO profile.  There are probably multiple ways you could do this;
> you could use basic block symbols, like your current implementation.
>
> -Eli
>
> -----Original Message-----
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
Sriraman
> Tallam via llvm-dev
> Sent: Tuesday, September 24, 2019 4:51 PM
> To: llvm-dev at lists.llvm.org
> Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
> Greetings,
>
> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
> framework, on large google benchmarks and noticed that it improves key
> performance metrics of these benchmarks by 2% to 6%, which is pretty
> impressive
> as this is over and above a baseline binaryalready heavily optimized with
> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance
> of
> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is
> also
> profile guided and does very aggressive performance optimizations, there is
> more room for performance improvements due to profile approximations while
> applying the transformations.  BOLT uses exact profiles from the final
> binary
> and is able to fill the gaps left by ThinLTO + PGO. The performance
> improvements due to BOLT come from basic block layout, function reordering
> and
> function splitting.
>
> While BOLT does an excellent job of squeezing extra performance from highly
> optimized binaries with optimizations such as code layout, it has these
> major
> issues:
>
>  * It does not take advantage of distributed build systems.
>  * It has scalability issues and to rewrite a binary with a ~300M text
> segment
> size:
>  * Memory foot-print is 70G.
>  * It takes more than 10 minutes to rewrite the binary.
>
> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
> original binary, optimizes and rewrites the final binary in one process.
> This
> limits the scalability of BOLT and the memory and time overhead shoots up
> quickly for large binaries.
>
> Inspired by the performance gains and to address the scalability issue of
> BOLT,
> we went about designing a scalable infrastructure that can perform
> BOLT-like
> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
> which can perform profile guided link time binary optimizations in a
> scalable
> way and is friendly to distributed build systems.  Our system leverages the
> existing capabilities of the compiler tool-chain and is not a stand alone
> tool.
> Like BOLT, our system boosts the performance of optimized binaries via
> link-time optimizations using accurate profiles of the binary. We discuss
> the
> Propeller system and show how to do the whole program basic block layout
> using
> Propeller.
>
> Propeller does whole program basic block layout at link time via basic
> block
> sections.  We have added support for having each basic block in its own
> section
> which allows the linker to do arbitrary reorderings of basic blocks to
> achieve
> any desired fine-grain code layout which includes block layout, function
> splitting and function reordering.  Our experiments on large real-world
> applications and SPEC with code layout show that Propeller can optimize as
> effectively as BOLT, with just 20% of its memory footprint and time
> overhead.
>
> An LLVM branch with propeller patches is available in the git repository
> here:
> https://github.com/google/llvm-propeller/  We will upload individual
> patches of
> the various elements for review.  We have attached a google doc describing
> the
> Propeller system with Experiments in detail,
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
>
> Quick Start - How to optimize further with Propeller?
>
> # git clone and build repo
>
> $ cd $LLVM_DIR && git clone
https://github.com/google/llvm-propeller.git
>
> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>
> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" …
\
>    $LLVM_DIR/llvm-propeller/llvm
>
> $ ninja -j25
>
> $ export PATH=$BUILD_DIR/bin:$PATH
>
>
> # Let’s Propeller-optimize the following program:
>
>
> # Step 1: Build the peak optimized binary with an additional flag.
>
> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels
> -fuse-ld=lld
>
> # Step 2: Profile the binary, only one side of the branch is executed.
> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&
> /dev/null
>
>
> # Step 3: Convert the profiles using the tool provided
> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>
>
> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag
> changed.
> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller
> -fuse-ld=lld
>
> In Step 4, the optimized bit code can be used if it is saved in Step1 as
> Propeller is active only during compile backend and link.  The optimized
> binary
> has a different layout of the basic blocks in main to keep the executed
> blocks
> together and split the cold blocks.
>
> Some of the key points:
>
> +  Added support for basic block sections, similar to function sections,
> where
> each basic block can reside in its own section.
>
> +  Jump instructions need 32-bit relocations and subsequent linker
> relaxations
> after basic block layout.  We would like to add a new relocation type for
> jump
> instructions to make it easier for relaxations and guarantee correctness.
>
> +  Added support in the linker to read profiles (PMU LBR) and discover
> optimal
> basic block layout using the Ex-TSP algorithm described here:
> https://arxiv.org/abs/1809.04676
>
> +  Added support in the linker to re-order basic block sections in any
> specified sequence (can be done with symbol ordering file).  This requires
> linker relaxations to delete and shrink branches across basic blocks.
>
> +  Compared our system to BOLT  and have shown that our system can produce
> similar performance improvements as BOLT but with much less memory and time
> overheads.  Our experiments are on very large warehouse-scale benchmarks
> and
> SPEC 2017.
>
> +  Listed why this cannot be done as part of PGO itself.  Post Link
> optimizations are able to transform the binary using accurate profiles and
> PGO
> passes suffer from profile imprecision.
>
> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic
> blocks
> via basic block sections.
>
> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes
> and
> binary sizes much more than DebugInfo due to lack of support for
> discontiguous
> address ranges.  We have talked about this and would like to make a case to
> support discontiguous ranges with CFI which will make basic block sections
> much
> more cheaper.
>
> Detailed RFC document here :
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
> Please let us know what you think,
> Thanks
> Sri on behalf of the team.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190925/b1661614/attachment.html>

Eli Friedman via llvm-dev

2019-Sep-26 19:38 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

From: Xinliang David Li <xinliangli at gmail.com>
Sent: Wednesday, September 25, 2019 5:58 PM
To: Eli Friedman <efriedma at quicinc.com>
Cc: Sriraman Tallam <tmsriram at google.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
My biggest question about this architecture is about when propeller runs basic
block reordering within a function.  It seems like a lot of the complexity comes
from using the proposed -fbasicblock-sections to generated mangled ELF, and then
re-parsing the mangled ELF as a separate step.  I'm not sure that's the
right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you
described.  And it's likely more complex on non-x86 targets, which have a
greater variety of conditional branches.  And the reordering itself involves a
bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block
reordering and function splitting by manipulating the ELF files, you could
perform reordering and splitting as part of link-time code generation, as an MIR
pass that runs just before the assembly printer.  MIR is almost exactly the form
you want for this sort of manipulation: it has basic blocks which correspond
closely to the final binary, and a high-level representation of branch
instructions.

This was considered for Propeller.  This  is currently being explored in a
similar way as an alternative of CSFDO which uses PMU samples.

Makes sense.

And it's before the DWARF/CFI emission, so you don't need to worry about
fixing them afterwards.  This should take less code overall, and much less
target-specific code. And infrastructure for function splitting would be useful
for non-Propeller workflows.

There are some minor downsides to this approach I can think of.  You lose a
little flexibility, in that you can't mix blocks from different functions
together, but you aren't doing that anyway, from your description?

One of the main design objectives of Propeller is to have the capability to do
interprocedural code transformations (reordering, cloning, dedupping etc), so
this won't be a minor downside. Function/block alignment (for branch
misprediction reduction etc) will also need to be done as a global optimization
in the future.

Okay, so my suggestion doesn’t work. I’m still concerned the proposed design is
going to push us in a direction we don’t want to go.  Particularly, if you’re
going to attempt more complicated transforms, the problems caused by the limited
information available in an ELF file will become more prominent.  I mean, yes,
you can come up with a more complicated implicit contract between the compiler
and Propeller about the exact format of Propeller basic blocks, and add
additional symbol annotations, and eventually come up with an “IR” that allows
Propeller to perform arbitrary code transforms.  But that’s inevitably going to
be more complicated, and harder to understand, than a compiler IR designed for
optimizations.

If we are going to stick with ELF-as-compiler-IR, I’d like to see more careful
documentation of the contract between the compiler and the Propeller linker, so
it’s clear what the compiler is/is not allowed to emit at compile-time, and so
other compilers could be made compatible in the future.  The current Propeller
implementation is doing some code rewrites that can’t be justified by just
standard ELF section reordering.

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190926/0e5f508e/attachment.html>

Sriraman Tallam via llvm-dev

2019-Sep-26 22:24 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <efriedma at quicinc.com>
wrote:>
>
>
> From: Xinliang David Li <xinliangli at gmail.com>
> Sent: Wednesday, September 25, 2019 5:58 PM
> To: Eli Friedman <efriedma at quicinc.com>
> Cc: Sriraman Tallam <tmsriram at google.com>; llvm-dev <llvm-dev
at lists.llvm.org>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations
>
>
>
>
>
>
>
> On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>
> My biggest question about this architecture is about when propeller runs
basic block reordering within a function.  It seems like a lot of the complexity
comes from using the proposed -fbasicblock-sections to generated mangled ELF,
and then re-parsing the mangled ELF as a separate step.  I'm not sure
that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like
you described.  And it's likely more complex on non-x86 targets, which have
a greater variety of conditional branches.  And the reordering itself involves a
bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block
reordering and function splitting by manipulating the ELF files, you could
perform reordering and splitting as part of link-time code generation, as an MIR
pass that runs just before the assembly printer.  MIR is almost exactly the form
you want for this sort of manipulation: it has basic blocks which correspond
closely to the final binary, and a high-level representation of branch
instructions.
>
>
>
> This was considered for Propeller.  This  is currently being explored in a
similar way as an alternative of CSFDO which uses PMU samples.
>
>
>
> Makes sense.
>
>
>
> And it's before the DWARF/CFI emission, so you don't need to worry
about fixing them afterwards.  This should take less code overall, and much less
target-specific code. And infrastructure for function splitting would be useful
for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of.  You lose a
little flexibility, in that you can't mix blocks from different functions
together, but you aren't doing that anyway, from your description?
>
>
>
> One of the main design objectives of Propeller is to have the capability to
do interprocedural code transformations (reordering, cloning, dedupping etc), so
this won't be a minor downside. Function/block alignment (for branch
misprediction reduction etc) will also need to be done as a global optimization
in the future.
>
>
>
> Okay, so my suggestion doesn’t work. I’m still concerned the proposed
design is going to push us in a direction we don’t want to go.  Particularly, if
you’re going to attempt more complicated transforms, the problems caused by the
limited information available in an ELF file will become more prominent.  I
mean, yes, you can come up with a more complicated implicit contract between the
compiler and Propeller about the exact format of Propeller basic blocks, and add
additional symbol annotations, and eventually come up with an “IR” that allows
Propeller to perform arbitrary code transforms.  But that’s inevitably going to
be more complicated, and harder to understand, than a compiler IR designed for
optimizations.
Thanks for the feedback, I am not sure I fully understand your
concerns but let me try to make some of the things clearer:

*  Propeller relinks. Specifically, it regenerates ELF object files
from MIR.  Even if MIR were serializable, we would still be starting
before CFI instruction inserter pass and then regenerate the native
ELF objects.
*  All the code transformations we are planning to do for futuristic
optimizations is on the MIR just like you noted in a previous email.
For example, prefetch insertion was one optimization we were looking
at where the input is inserting prefetches at specific points in the
binary which we will translate to basic blocks at MIR and then insert
them there.  So, I don't think we will be limited by ELF files.
* In comparison with BOLT, BOLT disassembles to MCInst to apply some
of these optimizations and we get to start from MIR without the cost
of disassembly.
* I think your concern is coming from looking at the linker
relaxations we perform for X86_64 and wondering if this would scale
for RISC, etc.  We have not looked at RISC but chatting with Rui
(ruiu@) briefly, this looked like it is doable even without using
thunks.

Thanks
Sri
>
>
>
> If we are going to stick with ELF-as-compiler-IR, I’d like to see more
careful documentation of the contract between the compiler and the Propeller
linker, so it’s clear what the compiler is/is not allowed to emit at
compile-time, and so other compilers could be made compatible in the future. 
The current Propeller implementation is doing some code rewrites that can’t be
justified by just standard ELF section reordering.
>
>
>
> -Eli

llvm dev - Sep 2019 - [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations