thr3ads.net - llvm dev - [LLVMdev] Cross-module function inlining [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Mark Muir

2010-Jan-13 16:38 UTC

[LLVMdev] Cross-module function inlining

I've developed a working LLVM back-end (based on LLVM 2.6) for a custom
architecture with its own tool chain. This tool chain creates stand-alone
programs from a single assembly. We used to use GCC, which supported producing a
single machine assembly from multiple source files.

I modified Clang to accept the architecture, but discovered that clang-cc (or
the Clang Tool subclass inside Clang) doesn't allow multiple source files to
be lowered to a single machine assembly. The ToolChain subclasses inside Clang
make use of the normal system linker to combine multiple modules, but this
isn't possible on our system.

So, I created a new Clang ToolChain subclass that forms a tool pipeline based on
the following:
- Run the existing Clang tool on each source file, using -emit-llvm to generate
a .bc file for each module.
- Run llvm-link to merge them into a single .bc file.
- Run llc to generate a complete machine assembly.
The last two were implemented together in a single Tool, performing the job of
the linker. Optimisation options are passed onto each tool.

This does the trick.

However, with optimisations enabled, the resulting code is not as efficient as
it would be if all the code were in a single module. In particular, function
inlining is only performed by clang (i.e. only on a module-by-module basis), and
not by llvm-link or llc. This can be seen in the resulting pass options with -O3
(obtained using '-Xclang -debug-only=Execution' and '-Xlinker
-debug-only=Execution'):

Clang:
Pass Arguments:  -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg
-globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg
-basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls
-instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl
-instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate
-domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa
-loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars
-loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt
-sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce
-simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmerge

llc:
Pass Arguments:  -preverify -domtree -verify -loops -loopsimplify
-scalar-evolution -iv-users -loop-reduce -lowerinvoke -unreachableblockelim
-codegenprepare -stack-protector -machine-function-analysis -machinedomtree
-machine-loops -machinelicm -machine-sink -unreachable-mbb-elimination -livevars
-phi-node-elimination -twoaddressinstruction -liveintervals
-simple-register-coalescing -livestacks -virtregmap -linearscan-regalloc
-stack-slot-coloring -prologepilog -machinedomtree -machine-loops -machine-loops

I'm sure I can hack away to manually add these passes, but I'd prefer an
informed opinion on the best way to achieve this, or if there's a more
proper way to achieve the same thing (i.e. inter-module function inlining).

Also, I've noticed another problem with this approach: when function
declarations are 'inline __attribute__((always_inline))' in header
files, where the corresponding function definition is in a separate module to
where the function is being called, LLVM will not inline the function call at
the call site, but will happily strip away the function body, resulting in
broken code. Is there a way to stop this?

Any guidance is much appreciated.

Regards,

- Mark

Nick Lewycky

2010-Jan-13 16:43 UTC

head link

[LLVMdev] Cross-module function inlining

Mark Muir wrote:> I've developed a working LLVM back-end (based on LLVM 2.6) for a custom
architecture with its own tool chain. This tool chain creates stand-alone
programs from a single assembly. We used to use GCC, which supported producing a
single machine assembly from multiple source files.
>
> I modified Clang to accept the architecture, but discovered that clang-cc
(or the Clang Tool subclass inside Clang) doesn't allow multiple source
files to be lowered to a single machine assembly. The ToolChain subclasses
inside Clang make use of the normal system linker to combine multiple modules,
but this isn't possible on our system.
>
> So, I created a new Clang ToolChain subclass that forms a tool pipeline
based on the following:
> - Run the existing Clang tool on each source file, using -emit-llvm to
generate a .bc file for each module.
> - Run llvm-link to merge them into a single .bc file.
> - Run llc to generate a complete machine assembly.
> The last two were implemented together in a single Tool, performing the job
of the linker. Optimisation options are passed onto each tool.
>
> This does the trick.
>
> However, with optimisations enabled, the resulting code is not as efficient
as it would be if all the code were in a single module. In particular, function
inlining is only performed by clang (i.e. only on a module-by-module basis), and
not by llvm-link or llc. This can be seen in the resulting pass options with -O3
(obtained using '-Xclang -debug-only=Execution' and '-Xlinker
-debug-only=Execution'):
It sounds like you're not running the LTO optimizations. You could try 
replacing llvm-link with llvm-ld which will, or run 'opt -std-link-opts'
between llvm-link and llc.
> Clang:
> Pass Arguments:  -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg
-globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg
-basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls
-instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl
-instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate
-domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa
-loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars
-loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt
-sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce
-simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmerge
This pass list is fine, it's equivalent to 'opt -std-compile-opts'.

Nick
> llc:
> Pass Arguments:  -preverify -domtree -verify -loops -loopsimplify
-scalar-evolution -iv-users -loop-reduce -lowerinvoke -unreachableblockelim
-codegenprepare -stack-protector -machine-function-analysis -machinedomtree
-machine-loops -machinelicm -machine-sink -unreachable-mbb-elimination -livevars
-phi-node-elimination -twoaddressinstruction -liveintervals
-simple-register-coalescing -livestacks -virtregmap -linearscan-regalloc
-stack-slot-coloring -prologepilog -machinedomtree -machine-loops -machine-loops
>
> I'm sure I can hack away to manually add these passes, but I'd
prefer an informed opinion on the best way to achieve this, or if there's a
more proper way to achieve the same thing (i.e. inter-module function inlining).
>
> Also, I've noticed another problem with this approach: when function
declarations are 'inline __attribute__((always_inline))' in header
files, where the corresponding function definition is in a separate module to
where the function is being called, LLVM will not inline the function call at
the call site, but will happily strip away the function body, resulting in
broken code. Is there a way to stop this?
>
> Any guidance is much appreciated.
>
> Regards,
>
> - Mark
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Mark Muir

2010-Jan-13 20:05 UTC

head link

[LLVMdev] Cross-module function inlining

On 13 Jan 2010, at 16:43, Nick Lewycky wrote:
> Mark Muir wrote:
>> - Run the existing Clang tool on each source file, using -emit-llvm to
generate a .bc file for each module.
>> - Run llvm-link to merge them into a single .bc file.
>> - Run llc to generate a complete machine assembly.
>> 
>> However, with optimisations enabled, the resulting code is not as
efficient as it would be if all the code were in a single module. In particular,
function inlining is only performed by clang (i.e. only on a module-by-module
basis), and not by llvm-link or llc.
> 
> It sounds like you're not running the LTO optimizations. You could try
replacing llvm-link with llvm-ld which will, or run 'opt -std-link-opts'
between llvm-link and llc.
> 
Yep, that sorted inlining. Thanks.

But... now there's a small problem with library calls. Symbols such as
'memset', 'malloc', etc. are being removed by global dead code
elimination. They are implemented in one of the bitcode modules that are linked
together (implementations are based on newlib). I get the same behaviour of them
being stripped even when they are live, by the following:

opt -internalize -globaldce

Other (not standard-library) functions implemented in different modules than
where they are called, are correctly seen as live. So, could this be something
to do with what is declared as a built-in? I haven't provided any list of
built-ins (or overridden the defaults), nor could I figure out how exactly to do
that.

I've also noticed other problems related to built-ins - in one example, code
made use of abs(), but didn't #include <stdlib.h>. The resulting code
compiled without warning or error, but the resulting code was broken, due to the
arguments not being seen as live, e.g.:

Without #include <stdlib.h>:

	0x181e8b0: i32 = TargetGlobalAddress <i32 (...)* @abs> 0 [TF=1]
=>	JUMP_CALLi <ga:abs>[TF=1], %r2<imp-def>, %r3<imp-def>,
%r4<imp-def,dead>, %r5<imp-def,dead>, %r6<imp-def,dead>,
%r7<imp-def,dead>, %r8<imp-def,dead>, %r9<imp-def,dead>,
%r10<imp-def,dead>

With #include <stdlib.h>:

	0x181e8b0: i32 = TargetGlobalAddress <i32 (i32)* @abs> 0 [TF=1]
=>	JUMP_CALLi <ga:abs>[TF=1], %r3<kill>, %r2<imp-def>,
%r3<imp-def>, %r4<imp-def,dead>, %r5<imp-def,dead>,
%r6<imp-def,dead>, %r7<imp-def,dead>, %r8<imp-def,dead>,
%r9<imp-def,dead>, %r10<imp-def,dead>

Where r2 is the link register, and r3 to r10 are argument/retval registers.
LowerFormalArguments() doesn't see any arguments in the former, and
consequently doesn't add input register nodes to the DAG.

I guess I need help with the concept of built-ins, and what code is related to
them in the Clang driver and back-end.

Regards,

- Mark

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Jan 2010 - [LLVMdev] Cross-module function inlining

[LLVMdev] Cross-module function inlining

[LLVMdev] Cross-module function inlining

[LLVMdev] Cross-module function inlining

Reasonably Related Threads