thr3ads.net - llvm dev - [LLVMdev] Small problem with the tail call elimination pass [May 2014]

If this information is useful, please help other people find it:
Share via:

Nicola Gigante

2014-May-08 15:57 UTC

[LLVMdev] Small problem with the tail call elimination pass

Hello everybody,

On the documentation page for the tailcallelim pass you can read:

"This pass transforms functions that are prevented from being tail
recursive by an
associative expression to use an accumulator variable, thus compiling the
typical
naive factorial or fib implementation into efficient code”

However, I don’t see this behavior when trying to compile this variant of
the mentioned “typical naive fib implementation”:

unsigned int fib(unsigned int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}

The IR with clang -O3 (version 3.4) is this:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 2
  br i1 %1, label %8, label %2

; <label>:2                                       ; preds = %0
  %3 = add i32 %n, -1
  %4 = tail call i32 @_Z9fibj(i32 %3)
  %5 = add i32 %n, -2
  %6 = tail call i32 @_Z9fibj(i32 %5)
  %7 = add i32 %6, %4
  ret i32 %7

; <label>:8                                       ; preds = %0
  ret i32 %n
}

You see that the tail call is not eliminated. However, it does get eliminated
if I change to this code:

unsigned int fib(unsigned int n) {
    return n <= 2 ? 1 : fib(n-1) + fib(n-2);
}

IR:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 3
  br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse:                                      ; preds = %0, %tailrecurse
  %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
  %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
  %2 = add i32 %n.tr2, -1
  %3 = tail call i32 @_Z9fibj(i32 %2)
  %4 = add i32 %n.tr2, -2
  %5 = add i32 %3, %accumulator.tr1
  %6 = icmp ult i32 %4, 3
  br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse._crit_edge:                           ; preds = %tailrecurse, %0
  %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
  ret i32 %accumulator.tr.lcssa
}

Now, note that gcc 4.8 compiles well both the examples, and the
resulting loop gets also unrolled a dozen times.

I’m compiling with -O3 with both compilers.

So what is stopping the tail call elimination pass to doing the same
thing?

Note that I don’t write this email because I need a performing
fibonacci function… What I’m trying to do is to track down the 
reason why fortran code is performing better than C on the
source code used as benchmark in the main page of the 
Julia language in [1]. I wasn’t expecting this huge difference
between fortran and C code on an example as simple as this, so I
tried to investigate, even if I don’t know anything about fortran.
Their source code use int instead of unsigned as in my example.
I don’t know quite why, but this slows down everything both in gcc
and clang, because of something related to how the two different
languages handle overflow in integer arithmetic, I think. When I use
unsigned int, gcc performance matches gfortran’s, and the generated
code is nearly identical, while clang stays way behind because of the
missing tail call elimination.

Note that even with the modified example where the tail call does
get eliminated, a simple benchmark shows that clang’s code
is still 2-3x times slower than gcc's, I suppose because of the missing
unrolling.

Note that in that benchmark, Julia seems winning too, and they use
LLVM as the backend so something’s missing...

Can you help me understand what’s happening?

Thank you,
Nicola

[1] http://julialang.org

Nick Lewycky

2014-May-08 23:00 UTC

head link

[LLVMdev] Small problem with the tail call elimination pass

On 8 May 2014 08:57, Nicola Gigante <nicola.gigante at gmail.com> wrote:
> Hello everybody,
>
> On the documentation page for the tailcallelim pass you can read:
>
> "This pass transforms functions that are prevented from being tail
> recursive by an
> associative expression to use an accumulator variable, thus compiling the
> typical
> naive factorial or fib implementation into efficient code”
>
> However, I don’t see this behavior when trying to compile this variant of
> the mentioned “typical naive fib implementation”:
>
> unsigned int fib(unsigned int n) {
>     return n < 2 ? n : fib(n-1) + fib(n-2);
> }
>
> The IR with clang -O3 (version 3.4) is this:
> define i32 @_Z9fibj(i32 %n) #0 {
>   %1 = icmp ult i32 %n, 2
>   br i1 %1, label %8, label %2
>
> ; <label>:2                                       ; preds = %0
>   %3 = add i32 %n, -1
>   %4 = tail call i32 @_Z9fibj(i32 %3)
>   %5 = add i32 %n, -2
>   %6 = tail call i32 @_Z9fibj(i32 %5)
>   %7 = add i32 %6, %4
>   ret i32 %7
>
> ; <label>:8                                       ; preds = %0
>   ret i32 %n
> }
>
> You see that the tail call is not eliminated. However, it does get
> eliminated
> if I change to this code:
>
> unsigned int fib(unsigned int n) {
>     return n <= 2 ? 1 : fib(n-1) + fib(n-2);
> }
>
> IR:
> define i32 @_Z9fibj(i32 %n) #0 {
>   %1 = icmp ult i32 %n, 3
>   br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse
>
> tailrecurse:                                      ; preds = %0,
> %tailrecurse
>   %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
>   %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
>   %2 = add i32 %n.tr2, -1
>   %3 = tail call i32 @_Z9fibj(i32 %2)
>   %4 = add i32 %n.tr2, -2
>   %5 = add i32 %3, %accumulator.tr1
>   %6 = icmp ult i32 %4, 3
>   br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse
>
> tailrecurse._crit_edge:                           ; preds = %tailrecurse,
> %0
>   %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
>   ret i32 %accumulator.tr.lcssa
> }
>
> Now, note that gcc 4.8 compiles well both the examples, and the
> resulting loop gets also unrolled a dozen times.
>
> I’m compiling with -O3 with both compilers.
>
> So what is stopping the tail call elimination pass to doing the same
> thing?
>
At the moment TRE runs, the function reads:

; Function Attrs: nounwind readnone
define i32 @fib(i32 %n) #0 {
entry:
  %cmp = icmp ult i32 %n, 2
  br i1 %cmp, label %cond.end, label %cond.false

cond.false:                                       ; preds = %entry
  %sub = add i32 %n, -1
  %call = tail call i32 @fib(i32 %sub)
  %sub1 = add i32 %n, -2
  %call2 = tail call i32 @fib(i32 %sub1)
  %add = add i32 %call, %call2
  br label %cond.end

cond.end:                                         ; preds = %entry,
%cond.false
  %cond = phi i32 [ %add, %cond.false ], [ %n, %entry ]
  ret i32 %cond
}

TRE starts at the 'ret' and looks upwards and checks all of that
block's
predecessors to find any that end in unconditional branches. Then it walks
upwards in such blocks to find if there's a "TRE candidate" which
is a call
to itself (but aborts if there are instructions is can't handle between the
call and the end of the block), and manages to find %call2 -- and calls
EliminateRecursiveTailCall on it. Good job! At this stage, it transforms
the "br label %cond.end" into a "ret i32 %add" and updates
the phi in
%cond.end. There are now *two* returns, the one in %cond.end is "ret i32
%n".

EliminateRecursiveTailCall detects that TRE'ing this requires performing
accumulator recursion elimination. To do this, it checks that "the function
containing the specified tail call consistently returns the same
runtime-constant value" (at all exit points except for our ret at the end
of %cond.false). Now that we have a ret i32 %n, this is false; the safety
check finds that the argument for %n can change because we pass in %sub and
%sub1 into ourselves instead of always passing %n to ourselves. That's what
stops the transformation.

Your working example makes two modifications, you change "n < 2" to
"n <2" and you change "? n" to "? 1". It's
the second of these that will make
the difference, because it changes %cond to:

  %cond = phi i32 [ %add, %cond.false ], [ 1, %entry ]

Everything proceeds the same way as above, except that when the branch in
%cond.false is replaced with a return, the return in %cond.end folds to
"ret i32 1", a constant. Being the same value each time, this is
simple
enough for accumulator recursion elimination.

Nick

PS. In case you aren't aware, I recently bemoaned the state of our TRE
abilities here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140505/216092.html

Note that I don’t write this email because I need a
performing> fibonacci function… What I’m trying to do is to track down the
> reason why fortran code is performing better than C on the
> source code used as benchmark in the main page of the
> Julia language in [1]. I wasn’t expecting this huge difference
> between fortran and C code on an example as simple as this, so I
> tried to investigate, even if I don’t know anything about fortran.
> Their source code use int instead of unsigned as in my example.
> I don’t know quite why, but this slows down everything both in gcc
> and clang, because of something related to how the two different
> languages handle overflow in integer arithmetic, I think. When I use
> unsigned int, gcc performance matches gfortran’s, and the generated
> code is nearly identical, while clang stays way behind because of the
> missing tail call elimination.
>
> Note that even with the modified example where the tail call does
> get eliminated, a simple benchmark shows that clang’s code
> is still 2-3x times slower than gcc's, I suppose because of the missing
> unrolling.
>
> Note that in that benchmark, Julia seems winning too, and they use
> LLVM as the backend so something’s missing...
>
> Can you help me understand what’s happening?
>
> Thank you,
> Nicola
>
> [1] http://julialang.org
>
>
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140508/990f8606/attachment.html>

Reid Kleckner

2014-May-08 23:22 UTC

head link

[LLVMdev] Small problem with the tail call elimination pass

I don't know who wrote that description of TRE, but your fib implementation
is obviously not tail recursive, so I don't know what we could do with it.
 Maybe the author had this version in mind:

int fib_help(int a, int b, int n) {
  if (n > 0)
    return fib_help(b, a+b, n-1);
  return a;
}
int fib(int n) {
  return fib_help(0, 1, n);
}

TRE can turn the helper into a loop here.


On Thu, May 8, 2014 at 8:57 AM, Nicola Gigante <nicola.gigante at
gmail.com>wrote:
> Hello everybody,
>
> On the documentation page for the tailcallelim pass you can read:
>
> "This pass transforms functions that are prevented from being tail
> recursive by an
> associative expression to use an accumulator variable, thus compiling the
> typical
> naive factorial or fib implementation into efficient code”
>
> However, I don’t see this behavior when trying to compile this variant of
> the mentioned “typical naive fib implementation”:
>
> unsigned int fib(unsigned int n) {
>     return n < 2 ? n : fib(n-1) + fib(n-2);
> }
>
> The IR with clang -O3 (version 3.4) is this:
> define i32 @_Z9fibj(i32 %n) #0 {
>   %1 = icmp ult i32 %n, 2
>   br i1 %1, label %8, label %2
>
> ; <label>:2                                       ; preds = %0
>   %3 = add i32 %n, -1
>   %4 = tail call i32 @_Z9fibj(i32 %3)
>   %5 = add i32 %n, -2
>   %6 = tail call i32 @_Z9fibj(i32 %5)
>   %7 = add i32 %6, %4
>   ret i32 %7
>
> ; <label>:8                                       ; preds = %0
>   ret i32 %n
> }
>
> You see that the tail call is not eliminated. However, it does get
> eliminated
> if I change to this code:
>
> unsigned int fib(unsigned int n) {
>     return n <= 2 ? 1 : fib(n-1) + fib(n-2);
> }
>
> IR:
> define i32 @_Z9fibj(i32 %n) #0 {
>   %1 = icmp ult i32 %n, 3
>   br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse
>
> tailrecurse:                                      ; preds = %0,
> %tailrecurse
>   %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
>   %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
>   %2 = add i32 %n.tr2, -1
>   %3 = tail call i32 @_Z9fibj(i32 %2)
>   %4 = add i32 %n.tr2, -2
>   %5 = add i32 %3, %accumulator.tr1
>   %6 = icmp ult i32 %4, 3
>   br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse
>
> tailrecurse._crit_edge:                           ; preds = %tailrecurse,
> %0
>   %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
>   ret i32 %accumulator.tr.lcssa
> }
>
> Now, note that gcc 4.8 compiles well both the examples, and the
> resulting loop gets also unrolled a dozen times.
>
> I’m compiling with -O3 with both compilers.
>
> So what is stopping the tail call elimination pass to doing the same
> thing?
>
> Note that I don’t write this email because I need a performing
> fibonacci function… What I’m trying to do is to track down the
> reason why fortran code is performing better than C on the
> source code used as benchmark in the main page of the
> Julia language in [1]. I wasn’t expecting this huge difference
> between fortran and C code on an example as simple as this, so I
> tried to investigate, even if I don’t know anything about fortran.
> Their source code use int instead of unsigned as in my example.
> I don’t know quite why, but this slows down everything both in gcc
> and clang, because of something related to how the two different
> languages handle overflow in integer arithmetic, I think. When I use
> unsigned int, gcc performance matches gfortran’s, and the generated
> code is nearly identical, while clang stays way behind because of the
> missing tail call elimination.
>
> Note that even with the modified example where the tail call does
> get eliminated, a simple benchmark shows that clang’s code
> is still 2-3x times slower than gcc's, I suppose because of the missing
> unrolling.
>
> Note that in that benchmark, Julia seems winning too, and they use
> LLVM as the backend so something’s missing...
>
> Can you help me understand what’s happening?
>
> Thank you,
> Nicola
>
> [1] http://julialang.org
>
>
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140508/0c1b1f06/attachment.html>

Nicola Gigante

2014-May-09 08:21 UTC

head link

[LLVMdev] Small problem with the tail call elimination pass

Il giorno 09/mag/2014, alle ore 01:00, Nick Lewycky <nlewycky at
google.com> ha scritto:
> At the moment TRE runs, the function reads:
> 
> ; Function Attrs: nounwind readnone
> define i32 @fib(i32 %n) #0 {
> entry:
>   %cmp = icmp ult i32 %n, 2
>   br i1 %cmp, label %cond.end, label %cond.false
> 
> cond.false:                                       ; preds = %entry
>   %sub = add i32 %n, -1
>   %call = tail call i32 @fib(i32 %sub)
>   %sub1 = add i32 %n, -2
>   %call2 = tail call i32 @fib(i32 %sub1)
>   %add = add i32 %call, %call2
>   br label %cond.end
> 
> cond.end:                                         ; preds = %entry,
%cond.false
>   %cond = phi i32 [ %add, %cond.false ], [ %n, %entry ]
>   ret i32 %cond
> }
> 
> TRE starts at the 'ret' and looks upwards and checks all of that
block's predecessors to find any that end in unconditional branches. Then it
walks upwards in such blocks to find if there's a "TRE candidate"
which is a call to itself (but aborts if there are instructions is can't
handle between the call and the end of the block), and manages to find %call2 --
and calls EliminateRecursiveTailCall on it. Good job! At this stage, it
transforms the "br label %cond.end" into a "ret i32 %add"
and updates the phi in %cond.end. There are now *two* returns, the one in
%cond.end is "ret i32 %n".
> 
> EliminateRecursiveTailCall detects that TRE'ing this requires
performing accumulator recursion elimination. To do this, it checks that
"the function containing the specified tail call consistently returns the
same runtime-constant value" (at all exit points except for our ret at the
end of %cond.false). Now that we have a ret i32 %n, this is false; the safety
check finds that the argument for %n can change because we pass in %sub and
%sub1 into ourselves instead of always passing %n to ourselves. That's what
stops the transformation.
> 
> Your working example makes two modifications, you change "n <
2" to "n <= 2" and you change "? n" to "?
1". It's the second of these that will make the difference, because it
changes %cond to:
> 
>   %cond = phi i32 [ %add, %cond.false ], [ 1, %entry ]
> 
> Everything proceeds the same way as above, except that when the branch in
%cond.false is replaced with a return, the return in %cond.end folds to
"ret i32 1", a constant. Being the same value each time, this is
simple enough for accumulator recursion elimination.
> 
Thank you very much for the explanation! It all makes sense now.
However, the point remains that gcc (compiling down from both C or fortran) can
indeed optimize this code despite the non-constant
base case. Do you or someone know what reasoning does it apply?
Should I file a bug report?
> 
> PS. In case you aren't aware, I recently bemoaned the state of our TRE
abilities here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140505/216092.html

Very interesting!
> Nick
Thanks,
Nicola

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140509/e56ec15d/attachment.html>

llvm dev - May 2014 - [LLVMdev] Small problem with the tail call elimination pass

[LLVMdev] Small problem with the tail call elimination pass

[LLVMdev] Small problem with the tail call elimination pass

[LLVMdev] Small problem with the tail call elimination pass

[LLVMdev] Small problem with the tail call elimination pass