thr3ads.net - llvm dev - [LLVMdev] Proposal: MCLinker - an LLVM integrated linker [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Chinyen Chou

2011-Nov-02 08:57 UTC

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

Thanks for the useful information. We notice that the idea of LIPO also can
help LLVM LTO if LLVM has FDO/PGO. And regarding Diablo, we'll learn from
it and I think we'll get some good ideas from it.

In MCLinker, the detail of the instructions and data in bitcode are still
kept during linking, so some opportunities to optimize the instruction in
bitcode become intuitive. Instruction relaxation is one of the cases.
(Since ARM is one of the target we focus on, I'm going to use ARM to
illustrate the problem.)

When linking bitcode and other object files, stubs are necessary if the
branch range is too far or ARM/THUMB mode switching. Google gold linker
uses two kinds of stubs basically. One is consecutive branch instructions,
and the other is one branch instruction with one following instruction
(e.g., ldr) which changes PC directly.

Example of the later cases,

1: bl    <stub_address>
...
2: ldr   pc, [pc, #-4]   ; stub
3: dcd   R_ARM_ABS32(X)

In MCLinker, we can optimize it as following:

X: ldr   ip, [pc, #-4]
Y: dcd   R_ARM_ABS32(X)
Z: bx    ip

Before optimization, some processors suffer from flushing ROB/Q because
their pipelines are fulfilled with the invalid instructions that
immediately appear after ldr. However, all of these instructions should not
be executed, and processors must flush them when ldr is committed.

Since all details of instruction and data are reserved, MCLinker can
directly rewrite the program without insertion of stub. It can replace the
1:bl instruction with a longer branch Z: bx, and the performance of the
program is therefore improved by efficient use of branch target buffer
(BTB).
This is just one case, and there are other optimizations we can do..

Thanks,
Chinyen
> In GCC, LTO causes 'fat' object files, because GCC needs to
serialize
> > IR into 'intermediate language' (IL) and compress IL in object
files.
> > In our experience, the 'fat' object files are x10 bigger than
the
> > original one, and slow down the linking process significantly. The
> > generated code can get about only 7%~13% improvement.
>
> Right.  Though GCC 4.7 will offer an option to emit just bytecode in
> object files.  Additionally, the biggest gains we generally observe
> with LTO is when it's coupled with FDO.  And almost always, the
> biggest wins are in the inliner
> (http://gcc.gnu.org/wiki/LightweightIpo).
>
> > Apart from the LTO, we also have some good idea on link time
> > optimization. I will open another thread to discuss this later.
>
> You may want to look at Diablo (http://diablo.elis.ugent.be/).  An
> optimizing linker that has been around for a while.  I'm not sure
> whether it is still being developed, but they had several interesting
> ideas in it.
>
>
> Diego.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111102/fb517538/attachment.html>

Michael Spencer

2011-Nov-03 07:27 UTC

head link

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

On Wed, Nov 2, 2011 at 11:05 PM, Don Quixote de la Mancha
<quixote at dulcineatech.com> wrote:> A helpful link-time optimization would be to place subroutines that
> are used close together in time also close together in the executable
> file.  That also goes for data that is in the executable file, whether
> initialized (.data segment) or zero-initialized (.bss).
>
> If the unit of linkage of code is the function rather than the
> compilation module, and the unit of linkage of data is the individual
> data item rather than all the .data and .bss items together that are
> in a compilation unit, you could rearrange them at will.
This is exactly what the atom model provides. And some of the use
cases you describe were actually discussed at the social tonight.

- Michael Spencer
> For architectures such as ARM that cannot make jumps to faraway
> addresses, you could make the destinations of subroutine calls close
> to the caller so you would not need so many trampolines.
>
> The locality improves the speed because the program would use the code
> and data caches more efficiently, and would page in data and code from
> disk less often.
>
> Having fewer physically resident pages also saves on precious kernel
> memory.  I read in O'Reilly's "Understanding the Linux
Kernel" that on
> the i386 architecture, the kernel's page tables consume most of the
> physical memory in the computer, leaving very little physical memory
> for user processes!
>
> A first cut would be to start with the runtime program startup code,
> which for C program then calls main().  The subroutines that main
> calls would be placed next in the file.  Suppose main calls Foo() and
> then Bar().  One would then place each of the subroutines that Foo()
> calls all together, then each of the subroutines that Bar() calls.
>
> It would be best if some static analysis were performed to determine
> in what order subroutines are called, and in what order .data and .bss
> memory is accessed.
>
> Getting that analysis right for the general case would not be easy, as
> the time-order in which subroutines are called may of course depend on
> the input data.  To improve the locality, one could produce an
> instrumented executable which saved a stack trace at the entry of each
> subroutine.  Examination of all the stack traces would enable a
> post-processing tool to generate a linker script that would be used
> for a second pass of the linker.  This is a form of profiler-guided
> optimization.
>
> For extra credit one could prepare multiple input files (or for
> interactive programs, several distinctly different GUI robot scripts).
>  Then the tool that prepared the linker script would try to optimize
> for the average case for most code.
>
> Regards,
>
> Don Quixote
> --
> Don Quixote de la Mancha
> Dulcinea Technologies Corporation
> Software of Elegance and Beauty
> http://www.dulcineatech.com
> quixote at dulcineatech.com

Nick Kledzik

2011-Nov-07 21:54 UTC

head link

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

On Nov 2, 2011, at 11:05 PM, Don Quixote de la Mancha wrote:
> A helpful link-time optimization would be to place subroutines that
> are used close together in time also close together in the executable
> file.  That also goes for data that is in the executable file, whether
> initialized (.data segment) or zero-initialized (.bss).
> 
> If the unit of linkage of code is the function rather than the
> compilation module, and the unit of linkage of data is the individual
> data item rather than all the .data and .bss items together that are
> in a compilation unit, you could rearrange them at will.
> 
> For architectures such as ARM that cannot make jumps to faraway
> addresses, you could make the destinations of subroutine calls close
> to the caller so you would not need so many trampolines.
> 
> The locality improves the speed because the program would use the code
> and data caches more efficiently, and would page in data and code from
> disk less often.
> 
> Having fewer physically resident pages also saves on precious kernel
> memory.  I read in O'Reilly's "Understanding the Linux
Kernel" that on
> the i386 architecture, the kernel's page tables consume most of the
> physical memory in the computer, leaving very little physical memory
> for user processes!
> 
> A first cut would be to start with the runtime program startup code,
> which for C program then calls main().  The subroutines that main
> calls would be placed next in the file.  Suppose main calls Foo() and
> then Bar().  One would then place each of the subroutines that Foo()
> calls all together, then each of the subroutines that Bar() calls.This static analysis does not capture virtual calls (either C++ or
Objective-C).  It may also causing error handling code to be moved into
the "hot" area.   

> 
> It would be best if some static analysis were performed to determine
> in what order subroutines are called, and in what order .data and .bss
> memory is accessed.
> 
> Getting that analysis right for the general case would not be easy, as
> the time-order in which subroutines are called may of course depend on
> the input data.  To improve the locality, one could produce an
> instrumented executable which saved a stack trace at the entry of each
> subroutine.  Examination of all the stack traces would enable a
> post-processing tool to generate a linker script that would be used
> for a second pass of the linker.  This is a form of profiler-guided
> optimization.At Apple we generate "order files" by running a program under dtrace. 
We use a feature of dtrace that sets a "one shot" break point on the 
start of every function.  You then run the program (under dtrace) and
you get a list of functions in the order they were first called with 
no need to build a special version of the program.  

Given that the optimal ordering is dependent on what the user does
to exercise different parts of the program, we've concluded the minimal
ordering is to just get initialization functions ordered.  This also helps
programs launch faster (less paging).

> 
> For extra credit one could prepare multiple input files (or for
> interactive programs, several distinctly different GUI robot scripts).
> Then the tool that prepared the linker script would try to optimize
> for the average case for most code.
>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Nov 2011 - [LLVMdev] Proposal: MCLinker - an LLVM integrated linker

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

[LLVMdev] Proposal: MCLinker - an LLVM integrated linker

Reasonably Related Threads