thr3ads.net - llvm dev - [LLVMdev] Improving performance with optimization passes [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Jon Harrop

2009-Feb-19 19:00 UTC

[LLVMdev] Improving performance with optimization passes

I'm toying with benchmarks on my HLVM and am unable to get any performance 
improvement from optimization passes. Moreover, some of my programs generate 
a lot of redundant code (e.g. alloca a struct, store a struct into it and 
read only one field without using the rest of the struct) and this does not 
appear to be optimized away.

I simply copied the use of PassManager from the Kaleidoscope tutorial:

  let pm = PassManager.create_function mp in
  TargetData.add (ExecutionEngine.target_data ee) pm;

  add_constant_propagation pm;

  (* Do simple "peephole" optimizations and bit-twiddling optzn. *)
  add_instruction_combining pm;

  (* reassociate expressions. *)
  add_reassociation pm;

  (* Eliminate Common SubExpressions. *)
  add_gvn pm;

  (* Simplify the control flow graph (deleting unreachable blocks, etc). *)
  add_cfg_simplification pm;

  add_memory_to_register_promotion pm;

and then I apply "PassManager.run_function" to every function after it
is
validated.

Any idea what I might be doing wrong? Has anyone else got this functionality 
giving performance boosts from OCaml?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

Gordon Henriksen

2009-Feb-19 19:32 UTC

head link

[LLVMdev] Improving performance with optimization passes

Hi Jon,

On 2009-02-19, at 14:00, Jon Harrop wrote:
> I'm toying with benchmarks on my HLVM and am unable to get any  
> performance improvement from optimization passes. I simply copied  
> the use of PassManager from the Kaleidoscope tutorial:
>
> Any idea what I might be doing wrong? Has anyone else got this  
> functionality giving performance boosts from OCaml?
That's a pretty barren optimization pipeline.

http://llvm.org/docs/Passes.html

See opt --help and look into what -std-compile-opts is. That's the  
usual starting point for new front ends, although it's a whole-module  
pass pipeline. But it's only a starting point, since it's tuned for  
llvm-gcc's codegen and yours will probably differ.
> Moreover, some of my programs generate a lot of redundant code (e.g.  
> alloca a struct, store a struct into it and read only one field  
> without using the rest of the struct) and this does not appear to be  
> optimized away.
I think first-class aggregates are mostly used for passing arguments  
in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have  
you tried emitting loads and stores of the scalar elements to see if  
mem2reg can eliminate the allocas then?

— Gordon

P.S. This is not a trivial problem domain. Here's an interesting paper  
on the subject.

COLE: Compiler Optimization Level Exploration
http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdf

Jon Harrop

2009-Feb-19 19:44 UTC

head link

[LLVMdev] Improving performance with optimization passes

On Thursday 19 February 2009 19:00:14 Jon Harrop wrote:> I'm toying with benchmarks on my HLVM and am unable to get any
performance
> improvement from optimization passes...
I just disassembled some of the IR before and after optimization. This example 
function squares a complex number:

  let zsqr(r, i) = (r*r - i*i, 2*r*i)

My compiler is generating:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
	%2 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%3 = getelementptr { double, double }* %2, i32 0		; <{ double, double }*>
[#uses=1]
	store { double, double } %1, { double, double }* %3
	%4 = getelementptr { double, double }* %2, i32 0, i32 0		; <double*> 
[#uses=1]
	%5 = load double* %4		; <double> [#uses=1]
	%6 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%7 = getelementptr { double, double }* %6, i32 0		; <{ double, double }*>
[#uses=1]
	store { double, double } %1, { double, double }* %7
	%8 = getelementptr { double, double }* %6, i32 0, i32 0		; <double*> 
[#uses=1]
	%9 = load double* %8		; <double> [#uses=1]
	%10 = mul double %5, %9		; <double> [#uses=1]
	%11 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%12 = getelementptr { double, double }* %11, i32 0		; <{ double, double
}*>
[#uses=1]
	store { double, double } %1, { double, double }* %12
	%13 = getelementptr { double, double }* %11, i32 0, i32 1		; <double*> 
[#uses=1]
	%14 = load double* %13		; <double> [#uses=1]
	%15 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%16 = getelementptr { double, double }* %15, i32 0		; <{ double, double
}*>
[#uses=1]
	store { double, double } %1, { double, double }* %16
	%17 = getelementptr { double, double }* %15, i32 0, i32 1		; <double*> 
[#uses=1]
	%18 = load double* %17		; <double> [#uses=1]
	%19 = mul double %14, %18		; <double> [#uses=1]
	%20 = sub double %10, %19		; <double> [#uses=1]
	%21 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%22 = getelementptr { double, double }* %21, i32 0		; <{ double, double
}*>
[#uses=1]
	store { double, double } %1, { double, double }* %22
	%23 = getelementptr { double, double }* %21, i32 0, i32 0		; <double*> 
[#uses=1]
	%24 = load double* %23		; <double> [#uses=1]
	%25 = mul double 2.000000e+00, %24		; <double> [#uses=1]
	%26 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	%27 = getelementptr { double, double }* %26, i32 0		; <{ double, double
}*>
[#uses=1]
	store { double, double } %1, { double, double }* %27
	%28 = getelementptr { double, double }* %26, i32 0, i32 1		; <double*> 
[#uses=1]
	%29 = load double* %28		; <double> [#uses=1]
	%30 = mul double %25, %29		; <double> [#uses=1]
	%31 = alloca { double, double }		; <{ double, double }*> [#uses=3]
	%32 = getelementptr { double, double }* %31, i32 0, i32 0		; <double*> 
[#uses=1]
	store double %20, double* %32
	%33 = getelementptr { double, double }* %31, i32 0, i32 1		; <double*> 
[#uses=1]
	store double %30, double* %33
	%34 = getelementptr { double, double }* %31, i32 0		; <{ double, double
}*>
[#uses=1]
	%35 = load { double, double }* %34		; <{ double, double }> [#uses=1]
	%36 = getelementptr { double, double }* %0, i32 0		; <{ double, double
}*>
[#uses=1]
	store { double, double } %35, { double, double }* %36
	ret i32 0
}

But those LLVM optimization passes only reduce it to:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
	%2 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %2, align 8
	%3 = getelementptr { double, double }* %2, i32 0, i32 0		; <double*> 
[#uses=1]
	%4 = load double* %3, align 8		; <double> [#uses=1]
	%5 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %5, align 8
	%6 = getelementptr { double, double }* %5, i32 0, i32 0		; <double*> 
[#uses=1]
	%7 = load double* %6, align 8		; <double> [#uses=1]
	%8 = mul double %4, %7		; <double> [#uses=1]
	%9 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %9, align 8
	%10 = getelementptr { double, double }* %9, i32 0, i32 1		; <double*> 
[#uses=1]
	%11 = load double* %10, align 8		; <double> [#uses=1]
	%12 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %12, align 8
	%13 = getelementptr { double, double }* %12, i32 0, i32 1		; <double*> 
[#uses=1]
	%14 = load double* %13, align 8		; <double> [#uses=1]
	%15 = mul double %11, %14		; <double> [#uses=1]
	%16 = sub double %8, %15		; <double> [#uses=1]
	%17 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %17, align 8
	%18 = getelementptr { double, double }* %17, i32 0, i32 0		; <double*> 
[#uses=1]
	%19 = load double* %18, align 8		; <double> [#uses=1]
	%20 = mul double %19, 2.000000e+00		; <double> [#uses=1]
	%21 = alloca { double, double }		; <{ double, double }*> [#uses=2]
	store { double, double } %1, { double, double }* %21, align 8
	%22 = getelementptr { double, double }* %21, i32 0, i32 1		; <double*> 
[#uses=1]
	%23 = load double* %22, align 8		; <double> [#uses=1]
	%24 = mul double %20, %23		; <double> [#uses=1]
	%25 = alloca { double, double }		; <{ double, double }*> [#uses=3]
	%26 = getelementptr { double, double }* %25, i32 0, i32 0		; <double*> 
[#uses=1]
	store double %16, double* %26, align 8
	%27 = getelementptr { double, double }* %25, i32 0, i32 1		; <double*> 
[#uses=1]
	store double %24, double* %27, align 8
	%28 = load { double, double }* %25, align 8		; <{ double, double }>
[#uses=1]
	store { double, double } %28, { double, double }* %0
	ret i32 0
}

So the optimization passes are at least doing something but they are a long 
way from generating optimal code. Does LLVM have any optimization passes that 
would promote these structs out of the stack and replace the loads with 
extractvalue instructions?

The ideal result is probably:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
        %1 = extractvalue {double, double} %1, 0
        %2 = extractvalue {double, double} %1, 1
        %3 = mul double %1, %1
        %4 = mul double %2, %2
        %5 = sub double %3, %4
	%6 = getelementptr { double, double }* %0, i32 0, i32 0
	store double %5, double* %6, align 8
        %7 = mul double %1, 2.0
        %8 = mul double %7, %2
	%9 = getelementptr { double, double }* %0, i32 0, i32 1
	store double %8, double* %9, align 8
	ret i32 0
}

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

Dan Gohman

2009-Feb-19 19:44 UTC

head link

[LLVMdev] Improving performance with optimization passes

To add to what Gordon said, the SROA pass, aka -scalarrepl, aka
scalar replacement of aggregates, is the main pass which splits
struct allocas into fields that the rest of the optimizer can
work with.

Dan

On Feb 19, 2009, at 11:00 AM, Jon Harrop wrote:
>
> I'm toying with benchmarks on my HLVM and am unable to get any  
> performance
> improvement from optimization passes. Moreover, some of my programs  
> generate
> a lot of redundant code (e.g. alloca a struct, store a struct into  
> it and
> read only one field without using the rest of the struct) and this  
> does not
> appear to be optimized away.
>
> I simply copied the use of PassManager from the Kaleidoscope tutorial:
>
>  let pm = PassManager.create_function mp in
>  TargetData.add (ExecutionEngine.target_data ee) pm;
>
>  add_constant_propagation pm;
>
>  (* Do simple "peephole" optimizations and bit-twiddling optzn.
*)
>  add_instruction_combining pm;
>
>  (* reassociate expressions. *)
>  add_reassociation pm;
>
>  (* Eliminate Common SubExpressions. *)
>  add_gvn pm;
>
>  (* Simplify the control flow graph (deleting unreachable blocks,  
> etc). *)
>  add_cfg_simplification pm;
>
>  add_memory_to_register_promotion pm;
>
> and then I apply "PassManager.run_function" to every function
after
> it is
> validated.
>
> Any idea what I might be doing wrong? Has anyone else got this  
> functionality
> giving performance boosts from OCaml?
>
> -- 
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> http://www.ffconsultancy.com/?e
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Jon Harrop

2009-Feb-19 20:01 UTC

head link

[LLVMdev] Improving performance with optimization passes

On Thursday 19 February 2009 19:32:14 Gordon Henriksen
wrote:> Hi Jon,
>
> On 2009-02-19, at 14:00, Jon Harrop wrote:
> > I'm toying with benchmarks on my HLVM and am unable to get any
> > performance improvement from optimization passes. I simply copied
> > the use of PassManager from the Kaleidoscope tutorial:
> >
> > Any idea what I might be doing wrong? Has anyone else got this
> > functionality giving performance boosts from OCaml?
>
> That's a pretty barren optimization pipeline.
Right but am I correct in believing that what I have done is pretty much all 
that you can do from OCaml right now?
> http://llvm.org/docs/Passes.html
>
> See opt --help and look into what -std-compile-opts is. That's the
> usual starting point for new front ends, although it's a whole-module
> pass pipeline. But it's only a starting point, since it's tuned for
> llvm-gcc's codegen and yours will probably differ.
Thanks.
> > Moreover, some of my programs generate a lot of redundant code (e.g.
> > alloca a struct, store a struct into it and read only one field
> > without using the rest of the struct) and this does not appear to be
> > optimized away.
>
> I think first-class aggregates are mostly used for passing arguments
> in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have
> you tried emitting loads and stores of the scalar elements to see if
> mem2reg can eliminate the allocas then?
If I could do that I wouldn't be in this mess! ;-)

I am only generating these temporary structs on the stack because I cannot 
load and store struct elements any other way because the OCaml bindings do 
not yet have insertvalue and extractvalue.
> P.S. This is not a trivial problem domain. Here's an interesting paper
> on the subject.
>
> COLE: Compiler Optimization Level Exploration
> http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdf
I'll check it out, thanks.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

Anton Korobeynikov

2009-Feb-19 20:02 UTC

head link

[LLVMdev] Improving performance with optimization passes

>
> On Thursday 19 February 2009 19:00:14 Jon Harrop wrote:
>> I'm toying with benchmarks on my HLVM and am unable to get any  
>> performance
>> improvement from optimization passes...
>
> I just disassembled some of the IR before and after optimization.  
> This example
> function squares a complex number:Something is definitely wrong with the way you're using optimization  
passes.
> The ideal result is probably:It is indeed so:

  ./opt -std-compile-opts test.bc | ./llvm-dis
; ModuleID = '<stdin>'

define fastcc i32 @zsqr({ double, double }* nocapture, { double,  
double }) nounwind {
entry:
	%2 = extractvalue { double, double } %1, 0		; <double> [#uses=1]
	%3 = extractvalue { double, double } %1, 0		; <double> [#uses=1]
	%4 = mul double %2, %3		; <double> [#uses=1]
	%5 = extractvalue { double, double } %1, 1		; <double> [#uses=1]
	%6 = extractvalue { double, double } %1, 1		; <double> [#uses=1]
	%7 = mul double %5, %6		; <double> [#uses=1]
	%8 = sub double %4, %7		; <double> [#uses=1]
	%9 = extractvalue { double, double } %1, 0		; <double> [#uses=1]
	%10 = mul double %9, 2.000000e+00		; <double> [#uses=1]
	%11 = extractvalue { double, double } %1, 1		; <double> [#uses=1]
	%12 = mul double %10, %11		; <double> [#uses=1]
	%insert = insertvalue { double, double } undef, double %8, 0		;  
<{ double, double }> [#uses=1]
	%insert2 = insertvalue { double, double } %insert, double %12, 1		;  
<{ double, double }> [#uses=1]
	store { double, double } %insert2, { double, double }* %0
	ret i32 0
}

---
With best regards, Anton Korobeynikov
Faculty of Mathematics and Mechanics, Saint Petersburg State University

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20090219/1d985257/attachment.html>

Chris Lattner

2009-Feb-19 22:31 UTC

head link

[LLVMdev] Improving performance with optimization passes

On Feb 19, 2009, at 11:44 AM, Jon Harrop wrote:
> On Thursday 19 February 2009 19:00:14 Jon Harrop wrote:
>> I'm toying with benchmarks on my HLVM and am unable to get any  
>> performance
>> improvement from optimization passes...
>
> I just disassembled some of the IR before and after optimization.  
> This example
> function squares a complex number:
>
>  let zsqr(r, i) = (r*r - i*i, 2*r*i)
>
> My compiler is generating:
>
> define fastcc i32 @zsqr({ double, double }*, { double, double }) {
> entry:
> 	%2 = alloca { double, double }		; <{ double, double }*> [#uses=2]
Jon, make sure you always emit allocas into the entry block of the  
function.  Otherwise mem2reg and SROA won't hack on them.

-Chris

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Feb 2009 - [LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

[LLVMdev] Improving performance with optimization passes

Maybe Matching Threads