I'm toying with benchmarks on my HLVM and am unable to get any performance improvement from optimization passes. Moreover, some of my programs generate a lot of redundant code (e.g. alloca a struct, store a struct into it and read only one field without using the rest of the struct) and this does not appear to be optimized away. I simply copied the use of PassManager from the Kaleidoscope tutorial: let pm = PassManager.create_function mp in TargetData.add (ExecutionEngine.target_data ee) pm; add_constant_propagation pm; (* Do simple "peephole" optimizations and bit-twiddling optzn. *) add_instruction_combining pm; (* reassociate expressions. *) add_reassociation pm; (* Eliminate Common SubExpressions. *) add_gvn pm; (* Simplify the control flow graph (deleting unreachable blocks, etc). *) add_cfg_simplification pm; add_memory_to_register_promotion pm; and then I apply "PassManager.run_function" to every function after it is validated. Any idea what I might be doing wrong? Has anyone else got this functionality giving performance boosts from OCaml? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e
Gordon Henriksen
2009-Feb-19 19:32 UTC
[LLVMdev] Improving performance with optimization passes
Hi Jon, On 2009-02-19, at 14:00, Jon Harrop wrote:> I'm toying with benchmarks on my HLVM and am unable to get any > performance improvement from optimization passes. I simply copied > the use of PassManager from the Kaleidoscope tutorial: > > Any idea what I might be doing wrong? Has anyone else got this > functionality giving performance boosts from OCaml?That's a pretty barren optimization pipeline. http://llvm.org/docs/Passes.html See opt --help and look into what -std-compile-opts is. That's the usual starting point for new front ends, although it's a whole-module pass pipeline. But it's only a starting point, since it's tuned for llvm-gcc's codegen and yours will probably differ.> Moreover, some of my programs generate a lot of redundant code (e.g. > alloca a struct, store a struct into it and read only one field > without using the rest of the struct) and this does not appear to be > optimized away.I think first-class aggregates are mostly used for passing arguments in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have you tried emitting loads and stores of the scalar elements to see if mem2reg can eliminate the allocas then? — Gordon P.S. This is not a trivial problem domain. Here's an interesting paper on the subject. COLE: Compiler Optimization Level Exploration http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdf
On Thursday 19 February 2009 19:00:14 Jon Harrop wrote:> I'm toying with benchmarks on my HLVM and am unable to get any performance > improvement from optimization passes...I just disassembled some of the IR before and after optimization. This example function squares a complex number: let zsqr(r, i) = (r*r - i*i, 2*r*i) My compiler is generating: define fastcc i32 @zsqr({ double, double }*, { double, double }) { entry: %2 = alloca { double, double } ; <{ double, double }*> [#uses=2] %3 = getelementptr { double, double }* %2, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %3 %4 = getelementptr { double, double }* %2, i32 0, i32 0 ; <double*> [#uses=1] %5 = load double* %4 ; <double> [#uses=1] %6 = alloca { double, double } ; <{ double, double }*> [#uses=2] %7 = getelementptr { double, double }* %6, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %7 %8 = getelementptr { double, double }* %6, i32 0, i32 0 ; <double*> [#uses=1] %9 = load double* %8 ; <double> [#uses=1] %10 = mul double %5, %9 ; <double> [#uses=1] %11 = alloca { double, double } ; <{ double, double }*> [#uses=2] %12 = getelementptr { double, double }* %11, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %12 %13 = getelementptr { double, double }* %11, i32 0, i32 1 ; <double*> [#uses=1] %14 = load double* %13 ; <double> [#uses=1] %15 = alloca { double, double } ; <{ double, double }*> [#uses=2] %16 = getelementptr { double, double }* %15, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %16 %17 = getelementptr { double, double }* %15, i32 0, i32 1 ; <double*> [#uses=1] %18 = load double* %17 ; <double> [#uses=1] %19 = mul double %14, %18 ; <double> [#uses=1] %20 = sub double %10, %19 ; <double> [#uses=1] %21 = alloca { double, double } ; <{ double, double }*> [#uses=2] %22 = getelementptr { double, double }* %21, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %22 %23 = getelementptr { double, double }* %21, i32 0, i32 0 ; <double*> [#uses=1] %24 = load double* %23 ; <double> [#uses=1] %25 = mul double 2.000000e+00, %24 ; <double> [#uses=1] %26 = alloca { double, double } ; <{ double, double }*> [#uses=2] %27 = getelementptr { double, double }* %26, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %1, { double, double }* %27 %28 = getelementptr { double, double }* %26, i32 0, i32 1 ; <double*> [#uses=1] %29 = load double* %28 ; <double> [#uses=1] %30 = mul double %25, %29 ; <double> [#uses=1] %31 = alloca { double, double } ; <{ double, double }*> [#uses=3] %32 = getelementptr { double, double }* %31, i32 0, i32 0 ; <double*> [#uses=1] store double %20, double* %32 %33 = getelementptr { double, double }* %31, i32 0, i32 1 ; <double*> [#uses=1] store double %30, double* %33 %34 = getelementptr { double, double }* %31, i32 0 ; <{ double, double }*> [#uses=1] %35 = load { double, double }* %34 ; <{ double, double }> [#uses=1] %36 = getelementptr { double, double }* %0, i32 0 ; <{ double, double }*> [#uses=1] store { double, double } %35, { double, double }* %36 ret i32 0 } But those LLVM optimization passes only reduce it to: define fastcc i32 @zsqr({ double, double }*, { double, double }) { entry: %2 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %2, align 8 %3 = getelementptr { double, double }* %2, i32 0, i32 0 ; <double*> [#uses=1] %4 = load double* %3, align 8 ; <double> [#uses=1] %5 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %5, align 8 %6 = getelementptr { double, double }* %5, i32 0, i32 0 ; <double*> [#uses=1] %7 = load double* %6, align 8 ; <double> [#uses=1] %8 = mul double %4, %7 ; <double> [#uses=1] %9 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %9, align 8 %10 = getelementptr { double, double }* %9, i32 0, i32 1 ; <double*> [#uses=1] %11 = load double* %10, align 8 ; <double> [#uses=1] %12 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %12, align 8 %13 = getelementptr { double, double }* %12, i32 0, i32 1 ; <double*> [#uses=1] %14 = load double* %13, align 8 ; <double> [#uses=1] %15 = mul double %11, %14 ; <double> [#uses=1] %16 = sub double %8, %15 ; <double> [#uses=1] %17 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %17, align 8 %18 = getelementptr { double, double }* %17, i32 0, i32 0 ; <double*> [#uses=1] %19 = load double* %18, align 8 ; <double> [#uses=1] %20 = mul double %19, 2.000000e+00 ; <double> [#uses=1] %21 = alloca { double, double } ; <{ double, double }*> [#uses=2] store { double, double } %1, { double, double }* %21, align 8 %22 = getelementptr { double, double }* %21, i32 0, i32 1 ; <double*> [#uses=1] %23 = load double* %22, align 8 ; <double> [#uses=1] %24 = mul double %20, %23 ; <double> [#uses=1] %25 = alloca { double, double } ; <{ double, double }*> [#uses=3] %26 = getelementptr { double, double }* %25, i32 0, i32 0 ; <double*> [#uses=1] store double %16, double* %26, align 8 %27 = getelementptr { double, double }* %25, i32 0, i32 1 ; <double*> [#uses=1] store double %24, double* %27, align 8 %28 = load { double, double }* %25, align 8 ; <{ double, double }> [#uses=1] store { double, double } %28, { double, double }* %0 ret i32 0 } So the optimization passes are at least doing something but they are a long way from generating optimal code. Does LLVM have any optimization passes that would promote these structs out of the stack and replace the loads with extractvalue instructions? The ideal result is probably: define fastcc i32 @zsqr({ double, double }*, { double, double }) { entry: %1 = extractvalue {double, double} %1, 0 %2 = extractvalue {double, double} %1, 1 %3 = mul double %1, %1 %4 = mul double %2, %2 %5 = sub double %3, %4 %6 = getelementptr { double, double }* %0, i32 0, i32 0 store double %5, double* %6, align 8 %7 = mul double %1, 2.0 %8 = mul double %7, %2 %9 = getelementptr { double, double }* %0, i32 0, i32 1 store double %8, double* %9, align 8 ret i32 0 } -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e
To add to what Gordon said, the SROA pass, aka -scalarrepl, aka scalar replacement of aggregates, is the main pass which splits struct allocas into fields that the rest of the optimizer can work with. Dan On Feb 19, 2009, at 11:00 AM, Jon Harrop wrote:> > I'm toying with benchmarks on my HLVM and am unable to get any > performance > improvement from optimization passes. Moreover, some of my programs > generate > a lot of redundant code (e.g. alloca a struct, store a struct into > it and > read only one field without using the rest of the struct) and this > does not > appear to be optimized away. > > I simply copied the use of PassManager from the Kaleidoscope tutorial: > > let pm = PassManager.create_function mp in > TargetData.add (ExecutionEngine.target_data ee) pm; > > add_constant_propagation pm; > > (* Do simple "peephole" optimizations and bit-twiddling optzn. *) > add_instruction_combining pm; > > (* reassociate expressions. *) > add_reassociation pm; > > (* Eliminate Common SubExpressions. *) > add_gvn pm; > > (* Simplify the control flow graph (deleting unreachable blocks, > etc). *) > add_cfg_simplification pm; > > add_memory_to_register_promotion pm; > > and then I apply "PassManager.run_function" to every function after > it is > validated. > > Any idea what I might be doing wrong? Has anyone else got this > functionality > giving performance boosts from OCaml? > > -- > Dr Jon Harrop, Flying Frog Consultancy Ltd. > http://www.ffconsultancy.com/?e > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On Thursday 19 February 2009 19:32:14 Gordon Henriksen wrote:> Hi Jon, > > On 2009-02-19, at 14:00, Jon Harrop wrote: > > I'm toying with benchmarks on my HLVM and am unable to get any > > performance improvement from optimization passes. I simply copied > > the use of PassManager from the Kaleidoscope tutorial: > > > > Any idea what I might be doing wrong? Has anyone else got this > > functionality giving performance boosts from OCaml? > > That's a pretty barren optimization pipeline.Right but am I correct in believing that what I have done is pretty much all that you can do from OCaml right now?> http://llvm.org/docs/Passes.html > > See opt --help and look into what -std-compile-opts is. That's the > usual starting point for new front ends, although it's a whole-module > pass pipeline. But it's only a starting point, since it's tuned for > llvm-gcc's codegen and yours will probably differ.Thanks.> > Moreover, some of my programs generate a lot of redundant code (e.g. > > alloca a struct, store a struct into it and read only one field > > without using the rest of the struct) and this does not appear to be > > optimized away. > > I think first-class aggregates are mostly used for passing arguments > in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have > you tried emitting loads and stores of the scalar elements to see if > mem2reg can eliminate the allocas then?If I could do that I wouldn't be in this mess! ;-) I am only generating these temporary structs on the stack because I cannot load and store struct elements any other way because the OCaml bindings do not yet have insertvalue and extractvalue.> P.S. This is not a trivial problem domain. Here's an interesting paper > on the subject. > > COLE: Compiler Optimization Level Exploration > http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdfI'll check it out, thanks. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e
Anton Korobeynikov
2009-Feb-19 20:02 UTC
[LLVMdev] Improving performance with optimization passes
> > On Thursday 19 February 2009 19:00:14 Jon Harrop wrote: >> I'm toying with benchmarks on my HLVM and am unable to get any >> performance >> improvement from optimization passes... > > I just disassembled some of the IR before and after optimization. > This example > function squares a complex number:Something is definitely wrong with the way you're using optimization passes.> The ideal result is probably:It is indeed so: ./opt -std-compile-opts test.bc | ./llvm-dis ; ModuleID = '<stdin>' define fastcc i32 @zsqr({ double, double }* nocapture, { double, double }) nounwind { entry: %2 = extractvalue { double, double } %1, 0 ; <double> [#uses=1] %3 = extractvalue { double, double } %1, 0 ; <double> [#uses=1] %4 = mul double %2, %3 ; <double> [#uses=1] %5 = extractvalue { double, double } %1, 1 ; <double> [#uses=1] %6 = extractvalue { double, double } %1, 1 ; <double> [#uses=1] %7 = mul double %5, %6 ; <double> [#uses=1] %8 = sub double %4, %7 ; <double> [#uses=1] %9 = extractvalue { double, double } %1, 0 ; <double> [#uses=1] %10 = mul double %9, 2.000000e+00 ; <double> [#uses=1] %11 = extractvalue { double, double } %1, 1 ; <double> [#uses=1] %12 = mul double %10, %11 ; <double> [#uses=1] %insert = insertvalue { double, double } undef, double %8, 0 ; <{ double, double }> [#uses=1] %insert2 = insertvalue { double, double } %insert, double %12, 1 ; <{ double, double }> [#uses=1] store { double, double } %insert2, { double, double }* %0 ret i32 0 } --- With best regards, Anton Korobeynikov Faculty of Mathematics and Mechanics, Saint Petersburg State University -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090219/1d985257/attachment.html>
Chris Lattner
2009-Feb-19 22:31 UTC
[LLVMdev] Improving performance with optimization passes
On Feb 19, 2009, at 11:44 AM, Jon Harrop wrote:> On Thursday 19 February 2009 19:00:14 Jon Harrop wrote: >> I'm toying with benchmarks on my HLVM and am unable to get any >> performance >> improvement from optimization passes... > > I just disassembled some of the IR before and after optimization. > This example > function squares a complex number: > > let zsqr(r, i) = (r*r - i*i, 2*r*i) > > My compiler is generating: > > define fastcc i32 @zsqr({ double, double }*, { double, double }) { > entry: > %2 = alloca { double, double } ; <{ double, double }*> [#uses=2]Jon, make sure you always emit allocas into the entry block of the function. Otherwise mem2reg and SROA won't hack on them. -Chris