I've developed a working LLVM back-end (based on LLVM 2.6) for a custom architecture with its own tool chain. This tool chain creates stand-alone programs from a single assembly. We used to use GCC, which supported producing a single machine assembly from multiple source files. I modified Clang to accept the architecture, but discovered that clang-cc (or the Clang Tool subclass inside Clang) doesn't allow multiple source files to be lowered to a single machine assembly. The ToolChain subclasses inside Clang make use of the normal system linker to combine multiple modules, but this isn't possible on our system. So, I created a new Clang ToolChain subclass that forms a tool pipeline based on the following: - Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module. - Run llvm-link to merge them into a single .bc file. - Run llc to generate a complete machine assembly. The last two were implemented together in a single Tool, performing the job of the linker. Optimisation options are passed onto each tool. This does the trick. However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc. This can be seen in the resulting pass options with -O3 (obtained using '-Xclang -debug-only=Execution' and '-Xlinker -debug-only=Execution'): Clang: Pass Arguments: -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg -globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls -instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl -instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate -domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars -loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt -sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce -simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmerge llc: Pass Arguments: -preverify -domtree -verify -loops -loopsimplify -scalar-evolution -iv-users -loop-reduce -lowerinvoke -unreachableblockelim -codegenprepare -stack-protector -machine-function-analysis -machinedomtree -machine-loops -machinelicm -machine-sink -unreachable-mbb-elimination -livevars -phi-node-elimination -twoaddressinstruction -liveintervals -simple-register-coalescing -livestacks -virtregmap -linearscan-regalloc -stack-slot-coloring -prologepilog -machinedomtree -machine-loops -machine-loops I'm sure I can hack away to manually add these passes, but I'd prefer an informed opinion on the best way to achieve this, or if there's a more proper way to achieve the same thing (i.e. inter-module function inlining). Also, I've noticed another problem with this approach: when function declarations are 'inline __attribute__((always_inline))' in header files, where the corresponding function definition is in a separate module to where the function is being called, LLVM will not inline the function call at the call site, but will happily strip away the function body, resulting in broken code. Is there a way to stop this? Any guidance is much appreciated. Regards, - Mark
Mark Muir wrote:> I've developed a working LLVM back-end (based on LLVM 2.6) for a custom architecture with its own tool chain. This tool chain creates stand-alone programs from a single assembly. We used to use GCC, which supported producing a single machine assembly from multiple source files. > > I modified Clang to accept the architecture, but discovered that clang-cc (or the Clang Tool subclass inside Clang) doesn't allow multiple source files to be lowered to a single machine assembly. The ToolChain subclasses inside Clang make use of the normal system linker to combine multiple modules, but this isn't possible on our system. > > So, I created a new Clang ToolChain subclass that forms a tool pipeline based on the following: > - Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module. > - Run llvm-link to merge them into a single .bc file. > - Run llc to generate a complete machine assembly. > The last two were implemented together in a single Tool, performing the job of the linker. Optimisation options are passed onto each tool. > > This does the trick. > > However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc. This can be seen in the resulting pass options with -O3 (obtained using '-Xclang -debug-only=Execution' and '-Xlinker -debug-only=Execution'):It sounds like you're not running the LTO optimizations. You could try replacing llvm-link with llvm-ld which will, or run 'opt -std-link-opts' between llvm-link and llc.> Clang: > Pass Arguments: -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg -globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls -instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl -instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate -domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars -loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt -sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce -simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmergeThis pass list is fine, it's equivalent to 'opt -std-compile-opts'. Nick> llc: > Pass Arguments: -preverify -domtree -verify -loops -loopsimplify -scalar-evolution -iv-users -loop-reduce -lowerinvoke -unreachableblockelim -codegenprepare -stack-protector -machine-function-analysis -machinedomtree -machine-loops -machinelicm -machine-sink -unreachable-mbb-elimination -livevars -phi-node-elimination -twoaddressinstruction -liveintervals -simple-register-coalescing -livestacks -virtregmap -linearscan-regalloc -stack-slot-coloring -prologepilog -machinedomtree -machine-loops -machine-loops > > I'm sure I can hack away to manually add these passes, but I'd prefer an informed opinion on the best way to achieve this, or if there's a more proper way to achieve the same thing (i.e. inter-module function inlining). > > Also, I've noticed another problem with this approach: when function declarations are 'inline __attribute__((always_inline))' in header files, where the corresponding function definition is in a separate module to where the function is being called, LLVM will not inline the function call at the call site, but will happily strip away the function body, resulting in broken code. Is there a way to stop this? > > Any guidance is much appreciated. > > Regards, > > - Mark > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
On 13 Jan 2010, at 16:43, Nick Lewycky wrote:> Mark Muir wrote: >> - Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module. >> - Run llvm-link to merge them into a single .bc file. >> - Run llc to generate a complete machine assembly. >> >> However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc. > > It sounds like you're not running the LTO optimizations. You could try replacing llvm-link with llvm-ld which will, or run 'opt -std-link-opts' between llvm-link and llc. >Yep, that sorted inlining. Thanks. But... now there's a small problem with library calls. Symbols such as 'memset', 'malloc', etc. are being removed by global dead code elimination. They are implemented in one of the bitcode modules that are linked together (implementations are based on newlib). I get the same behaviour of them being stripped even when they are live, by the following: opt -internalize -globaldce Other (not standard-library) functions implemented in different modules than where they are called, are correctly seen as live. So, could this be something to do with what is declared as a built-in? I haven't provided any list of built-ins (or overridden the defaults), nor could I figure out how exactly to do that. I've also noticed other problems related to built-ins - in one example, code made use of abs(), but didn't #include <stdlib.h>. The resulting code compiled without warning or error, but the resulting code was broken, due to the arguments not being seen as live, e.g.: Without #include <stdlib.h>: 0x181e8b0: i32 = TargetGlobalAddress <i32 (...)* @abs> 0 [TF=1] => JUMP_CALLi <ga:abs>[TF=1], %r2<imp-def>, %r3<imp-def>, %r4<imp-def,dead>, %r5<imp-def,dead>, %r6<imp-def,dead>, %r7<imp-def,dead>, %r8<imp-def,dead>, %r9<imp-def,dead>, %r10<imp-def,dead> With #include <stdlib.h>: 0x181e8b0: i32 = TargetGlobalAddress <i32 (i32)* @abs> 0 [TF=1] => JUMP_CALLi <ga:abs>[TF=1], %r3<kill>, %r2<imp-def>, %r3<imp-def>, %r4<imp-def,dead>, %r5<imp-def,dead>, %r6<imp-def,dead>, %r7<imp-def,dead>, %r8<imp-def,dead>, %r9<imp-def,dead>, %r10<imp-def,dead> Where r2 is the link register, and r3 to r10 are argument/retval registers. LowerFormalArguments() doesn't see any arguments in the former, and consequently doesn't add input register nodes to the DAG. I guess I need help with the concept of built-ins, and what code is related to them in the Clang driver and back-end. Regards, - Mark
Possibly Parallel Threads
- [LLVMdev] Cross-module function inlining
- [LLVMdev] Proposal: Debug information improvement - keep the line number with optimizations
- [LLVMdev] 32bit math being promoted to 64 bit
- [LLVMdev] loop multiversioning
- [LLVMdev] How to make Polly ignore some non-affine memory accesses