thr3ads.net - llvm dev - [LLVMdev] Hoisting elements of array argument into registers [Nov 2010]

If this information is useful, please help other people find it:
Share via:

David Peixotto

2010-Nov-06 22:00 UTC

[LLVMdev] Hoisting elements of array argument into registers

I am seeing the wf loop get optimized just fine with llvm 2.8 (and almost as
good with head). I'm running on Mac OS X 10.6. I have an apple supplied
llvm-gcc and a self compiled llvm 2.8. When I run

    $ llvm-gcc -emit-llvm -S M.c
    $ opt -O2 M.s | llvm-dis

I see that:

1. Tail recursion has been eliminated from wf 
2. The accesses to sp have been promoted to registers 
3. The loop has been translated to straight line code that computes the result
directly based off of the induction variables.

I am surprised that others are not seeing the same thing. If I download the
llvm-gcc binary from http://llvm.org/releases/download.html#2.8 and run

    $ llvm-gcc -O2 -emit-llvm -S

I see the same results as running `opt -O2` by hand. However, if I use the apple
supplied llvm-gcc and run `llvm-gcc -O2`, I see that the loop does not get
optimized. I am assuming this is because the apple supplied llvm-gcc links with
an older version of the llvm optimizations.

If I optimize with LLVM head I see basically the same results, except that it
leaves in one redundant load. I'm curious why others are seeing such
different results.

I don't see anything in wf that should prevent the sp accesses from being
promoted to registers. I would think they should be promoted to registers, but
we will still have to write the results back to sp before returning from wf
because in general we can't prove that sp is not accessed after returning
from wf.

-David

On Nov 5, 2010, at 5:16 PM, Max Bolingbroke wrote:
> Duncan Sands <baldrick <at> free.fr> writes:
>> 
>>> I see the same with clang.  I'm not sure why the optimizers do
so much better
>>> when they can see that sp is a local array (the special initial
values don't
>>> matter).
>> 
>> It is the scalar replacement of aggregates pass that puts everything
into
>> registers when sp is a local array. 
> 
> Yes, I had hoped that scalar replacement would get the array case. What
> surprised me is that it didn't. However, I can reproduce your result
(good
> optimisation for g()) with LLVM HEAD. My earlier tests used "opt
-O2", but
> once I tried again with (HEAD) Clang -O3 I got an optimised g().
> 
> However, scalar replacement can't help with functions like a
(non-inlined) wf()
> where the structure of sp is unknown. Is there any hope for LLVM optimising
> such functions? Some combination of passes that will do what I want? This
> problem is essentially killing all opportunities for loop optimisation in
> Haskell right now, so we would dearly like to have a solution :-)
> 
> Cheers,
> Max
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Max Bolingbroke

2010-Nov-07 13:52 UTC

head link

[LLVMdev] Hoisting elements of array argument into registers

David Peixotto <dmp <at> rice.edu> writes:> I am seeing the wf loop get optimized just fine with llvm 2.8 (and almostas good with head).

I rechecked this and am I actually seeing the same results as you. I think I
must have made a stupid mistake in my tests before - sorry for the noise.

However, I found that we have a phase ordering problem which is preventing us
getting as much optimisation from LLVM as we might expect. For example, I get
this code for g(int):

"""
define i32 @g(i32 %a) nounwind readnone ssp {
entry:
  %sp = alloca [4 x i32], align 4
  %0 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 0
  store i32 %a, i32* %0, align 4
  %1 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 1
  store i32 %a, i32* %1, align 4
  %2 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 2
  store i32 0, i32* %2, align 4
  %3 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 3
  store i32 0, i32* %3, align 4
  %4 = icmp eq i32 %a, 0
  br i1 %4, label %wf.exit, label %bb.nph.i

bb.nph.i:                                         ; preds = %entry
  %.promoted1.i = load i32* %1, align 4
  %tmp12.i = add i32 %a, -1
  %tmp13.i = zext i32 %tmp12.i to i33
  %tmp14.i = add i32 %a, -2
  %tmp15.i = zext i32 %tmp14.i to i33
  %tmp16.i = mul i33 %tmp13.i, %tmp15.i
  %tmp17.i = lshr i33 %tmp16.i, 1
  %tmp18.i = trunc i33 %tmp17.i to i32
  %tmp20.i = mul i32 %.promoted1.i, 5
  %tmp21.i = add i32 %tmp20.i, -5
  %tmp22.i = mul i32 %tmp21.i, %tmp12.i
  %tmp9.i = mul i32 %a, %a
  %.promoted2.i = load i32* %2, align 4
  %tmp25.i = mul i32 %tmp18.i, 5
  %tmp.i = sub i32 %.promoted1.i, %a
  %tmp10.i = add i32 %tmp9.i, 1
  %tmp11.i = sub i32 %tmp10.i, %tmp18.i
  %tmp19.i = add i32 %tmp11.i, %.promoted2.i
  %tmp23.i = sub i32 %tmp20.i, %tmp25.i
  %tmp26.i = add i32 %tmp23.i, %tmp22.i
  store i32 0, i32* %0, align 4
  store i32 %tmp19.i, i32* %2, align 4
  store i32 %tmp.i, i32* %1, align 4
  store i32 %tmp26.i, i32* %3, align 4
  br label %wf.exit

wf.exit:                                          ; preds = %entry, %bb.nph.i
  %5 = phi i32 [ %tmp26.i, %bb.nph.i ], [ 0, %entry ]
  %6 = load i32* %2, align 4
  %7 = add nsw i32 %6, %5
  ret i32 %7
}
"""

As you can see, we have "load i32* %1, align 4" which is dominated by
"store i32 %a, i32* %1, align 4",  and similarly for the load/store to
%2.
What is more, it is clear than no intervening store aliases with that 
store because all the other stores are to different pointers - and indeed
LLVM can even see that the memory to which they store is allocaed.

In fact there are at least 3 missed opportunities:
* No forwarding of the stores to loads: if we did this, we could spot that
%tmp.i is just 0, and %tmp19.iis just the same as %tmp11.i
* The stores to %2 could be forwarded to the final "load i32* %2, align
4"
by introducing a new phi node for the wf.exit block
* All of the stores to %sp+N can then be eliminated since the memory is
allocaed, and there will be no loads from any of that memory before the
function returns

However, it turns out that if you just run "opt -O3" twice then all of
this
stuff gets eliminated and you get truly optimal code - pretty awesome!

Cheers,
Max

Duncan Sands

2010-Nov-12 14:41 UTC

head link

[LLVMdev] Hoisting elements of array argument into registers

> However, it turns out that if you just run "opt -O3" twice then
all of this
> stuff gets eliminated and you get truly optimal code - pretty awesome!
Yes, if I pipe the output from "dragonegg -O3" into "opt
-O3" then the loop
is eliminated in "wf".  It would be interesting to find out why, and
improve
the current list of passes.

Ciao,

Duncan.

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Nov 2010 - [LLVMdev] Hoisting elements of array argument into registers

[LLVMdev] Hoisting elements of array argument into registers

[LLVMdev] Hoisting elements of array argument into registers

[LLVMdev] Hoisting elements of array argument into registers

Apparently Analagous Threads