Nicola Gigante
2014-May-08 15:57 UTC
[LLVMdev] Small problem with the tail call elimination pass
Hello everybody, On the documentation page for the tailcallelim pass you can read: "This pass transforms functions that are prevented from being tail recursive by an associative expression to use an accumulator variable, thus compiling the typical naive factorial or fib implementation into efficient code” However, I don’t see this behavior when trying to compile this variant of the mentioned “typical naive fib implementation”: unsigned int fib(unsigned int n) { return n < 2 ? n : fib(n-1) + fib(n-2); } The IR with clang -O3 (version 3.4) is this: define i32 @_Z9fibj(i32 %n) #0 { %1 = icmp ult i32 %n, 2 br i1 %1, label %8, label %2 ; <label>:2 ; preds = %0 %3 = add i32 %n, -1 %4 = tail call i32 @_Z9fibj(i32 %3) %5 = add i32 %n, -2 %6 = tail call i32 @_Z9fibj(i32 %5) %7 = add i32 %6, %4 ret i32 %7 ; <label>:8 ; preds = %0 ret i32 %n } You see that the tail call is not eliminated. However, it does get eliminated if I change to this code: unsigned int fib(unsigned int n) { return n <= 2 ? 1 : fib(n-1) + fib(n-2); } IR: define i32 @_Z9fibj(i32 %n) #0 { %1 = icmp ult i32 %n, 3 br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse tailrecurse: ; preds = %0, %tailrecurse %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ] %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ] %2 = add i32 %n.tr2, -1 %3 = tail call i32 @_Z9fibj(i32 %2) %4 = add i32 %n.tr2, -2 %5 = add i32 %3, %accumulator.tr1 %6 = icmp ult i32 %4, 3 br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse tailrecurse._crit_edge: ; preds = %tailrecurse, %0 %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ] ret i32 %accumulator.tr.lcssa } Now, note that gcc 4.8 compiles well both the examples, and the resulting loop gets also unrolled a dozen times. I’m compiling with -O3 with both compilers. So what is stopping the tail call elimination pass to doing the same thing? Note that I don’t write this email because I need a performing fibonacci function… What I’m trying to do is to track down the reason why fortran code is performing better than C on the source code used as benchmark in the main page of the Julia language in [1]. I wasn’t expecting this huge difference between fortran and C code on an example as simple as this, so I tried to investigate, even if I don’t know anything about fortran. Their source code use int instead of unsigned as in my example. I don’t know quite why, but this slows down everything both in gcc and clang, because of something related to how the two different languages handle overflow in integer arithmetic, I think. When I use unsigned int, gcc performance matches gfortran’s, and the generated code is nearly identical, while clang stays way behind because of the missing tail call elimination. Note that even with the modified example where the tail call does get eliminated, a simple benchmark shows that clang’s code is still 2-3x times slower than gcc's, I suppose because of the missing unrolling. Note that in that benchmark, Julia seems winning too, and they use LLVM as the backend so something’s missing... Can you help me understand what’s happening? Thank you, Nicola [1] http://julialang.org
Nick Lewycky
2014-May-08 23:00 UTC
[LLVMdev] Small problem with the tail call elimination pass
On 8 May 2014 08:57, Nicola Gigante <nicola.gigante at gmail.com> wrote:> Hello everybody, > > On the documentation page for the tailcallelim pass you can read: > > "This pass transforms functions that are prevented from being tail > recursive by an > associative expression to use an accumulator variable, thus compiling the > typical > naive factorial or fib implementation into efficient code” > > However, I don’t see this behavior when trying to compile this variant of > the mentioned “typical naive fib implementation”: > > unsigned int fib(unsigned int n) { > return n < 2 ? n : fib(n-1) + fib(n-2); > } > > The IR with clang -O3 (version 3.4) is this: > define i32 @_Z9fibj(i32 %n) #0 { > %1 = icmp ult i32 %n, 2 > br i1 %1, label %8, label %2 > > ; <label>:2 ; preds = %0 > %3 = add i32 %n, -1 > %4 = tail call i32 @_Z9fibj(i32 %3) > %5 = add i32 %n, -2 > %6 = tail call i32 @_Z9fibj(i32 %5) > %7 = add i32 %6, %4 > ret i32 %7 > > ; <label>:8 ; preds = %0 > ret i32 %n > } > > You see that the tail call is not eliminated. However, it does get > eliminated > if I change to this code: > > unsigned int fib(unsigned int n) { > return n <= 2 ? 1 : fib(n-1) + fib(n-2); > } > > IR: > define i32 @_Z9fibj(i32 %n) #0 { > %1 = icmp ult i32 %n, 3 > br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse > > tailrecurse: ; preds = %0, > %tailrecurse > %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ] > %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ] > %2 = add i32 %n.tr2, -1 > %3 = tail call i32 @_Z9fibj(i32 %2) > %4 = add i32 %n.tr2, -2 > %5 = add i32 %3, %accumulator.tr1 > %6 = icmp ult i32 %4, 3 > br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse > > tailrecurse._crit_edge: ; preds = %tailrecurse, > %0 > %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ] > ret i32 %accumulator.tr.lcssa > } > > Now, note that gcc 4.8 compiles well both the examples, and the > resulting loop gets also unrolled a dozen times. > > I’m compiling with -O3 with both compilers. > > So what is stopping the tail call elimination pass to doing the same > thing? >At the moment TRE runs, the function reads: ; Function Attrs: nounwind readnone define i32 @fib(i32 %n) #0 { entry: %cmp = icmp ult i32 %n, 2 br i1 %cmp, label %cond.end, label %cond.false cond.false: ; preds = %entry %sub = add i32 %n, -1 %call = tail call i32 @fib(i32 %sub) %sub1 = add i32 %n, -2 %call2 = tail call i32 @fib(i32 %sub1) %add = add i32 %call, %call2 br label %cond.end cond.end: ; preds = %entry, %cond.false %cond = phi i32 [ %add, %cond.false ], [ %n, %entry ] ret i32 %cond } TRE starts at the 'ret' and looks upwards and checks all of that block's predecessors to find any that end in unconditional branches. Then it walks upwards in such blocks to find if there's a "TRE candidate" which is a call to itself (but aborts if there are instructions is can't handle between the call and the end of the block), and manages to find %call2 -- and calls EliminateRecursiveTailCall on it. Good job! At this stage, it transforms the "br label %cond.end" into a "ret i32 %add" and updates the phi in %cond.end. There are now *two* returns, the one in %cond.end is "ret i32 %n". EliminateRecursiveTailCall detects that TRE'ing this requires performing accumulator recursion elimination. To do this, it checks that "the function containing the specified tail call consistently returns the same runtime-constant value" (at all exit points except for our ret at the end of %cond.false). Now that we have a ret i32 %n, this is false; the safety check finds that the argument for %n can change because we pass in %sub and %sub1 into ourselves instead of always passing %n to ourselves. That's what stops the transformation. Your working example makes two modifications, you change "n < 2" to "n <2" and you change "? n" to "? 1". It's the second of these that will make the difference, because it changes %cond to: %cond = phi i32 [ %add, %cond.false ], [ 1, %entry ] Everything proceeds the same way as above, except that when the branch in %cond.false is replaced with a return, the return in %cond.end folds to "ret i32 1", a constant. Being the same value each time, this is simple enough for accumulator recursion elimination. Nick PS. In case you aren't aware, I recently bemoaned the state of our TRE abilities here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140505/216092.html Note that I don’t write this email because I need a performing> fibonacci function… What I’m trying to do is to track down the > reason why fortran code is performing better than C on the > source code used as benchmark in the main page of the > Julia language in [1]. I wasn’t expecting this huge difference > between fortran and C code on an example as simple as this, so I > tried to investigate, even if I don’t know anything about fortran. > Their source code use int instead of unsigned as in my example. > I don’t know quite why, but this slows down everything both in gcc > and clang, because of something related to how the two different > languages handle overflow in integer arithmetic, I think. When I use > unsigned int, gcc performance matches gfortran’s, and the generated > code is nearly identical, while clang stays way behind because of the > missing tail call elimination. > > Note that even with the modified example where the tail call does > get eliminated, a simple benchmark shows that clang’s code > is still 2-3x times slower than gcc's, I suppose because of the missing > unrolling. > > Note that in that benchmark, Julia seems winning too, and they use > LLVM as the backend so something’s missing... > > Can you help me understand what’s happening? > > Thank you, > Nicola > > [1] http://julialang.org > > > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140508/990f8606/attachment.html>
Reid Kleckner
2014-May-08 23:22 UTC
[LLVMdev] Small problem with the tail call elimination pass
I don't know who wrote that description of TRE, but your fib implementation is obviously not tail recursive, so I don't know what we could do with it. Maybe the author had this version in mind: int fib_help(int a, int b, int n) { if (n > 0) return fib_help(b, a+b, n-1); return a; } int fib(int n) { return fib_help(0, 1, n); } TRE can turn the helper into a loop here. On Thu, May 8, 2014 at 8:57 AM, Nicola Gigante <nicola.gigante at gmail.com>wrote:> Hello everybody, > > On the documentation page for the tailcallelim pass you can read: > > "This pass transforms functions that are prevented from being tail > recursive by an > associative expression to use an accumulator variable, thus compiling the > typical > naive factorial or fib implementation into efficient code” > > However, I don’t see this behavior when trying to compile this variant of > the mentioned “typical naive fib implementation”: > > unsigned int fib(unsigned int n) { > return n < 2 ? n : fib(n-1) + fib(n-2); > } > > The IR with clang -O3 (version 3.4) is this: > define i32 @_Z9fibj(i32 %n) #0 { > %1 = icmp ult i32 %n, 2 > br i1 %1, label %8, label %2 > > ; <label>:2 ; preds = %0 > %3 = add i32 %n, -1 > %4 = tail call i32 @_Z9fibj(i32 %3) > %5 = add i32 %n, -2 > %6 = tail call i32 @_Z9fibj(i32 %5) > %7 = add i32 %6, %4 > ret i32 %7 > > ; <label>:8 ; preds = %0 > ret i32 %n > } > > You see that the tail call is not eliminated. However, it does get > eliminated > if I change to this code: > > unsigned int fib(unsigned int n) { > return n <= 2 ? 1 : fib(n-1) + fib(n-2); > } > > IR: > define i32 @_Z9fibj(i32 %n) #0 { > %1 = icmp ult i32 %n, 3 > br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse > > tailrecurse: ; preds = %0, > %tailrecurse > %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ] > %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ] > %2 = add i32 %n.tr2, -1 > %3 = tail call i32 @_Z9fibj(i32 %2) > %4 = add i32 %n.tr2, -2 > %5 = add i32 %3, %accumulator.tr1 > %6 = icmp ult i32 %4, 3 > br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse > > tailrecurse._crit_edge: ; preds = %tailrecurse, > %0 > %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ] > ret i32 %accumulator.tr.lcssa > } > > Now, note that gcc 4.8 compiles well both the examples, and the > resulting loop gets also unrolled a dozen times. > > I’m compiling with -O3 with both compilers. > > So what is stopping the tail call elimination pass to doing the same > thing? > > Note that I don’t write this email because I need a performing > fibonacci function… What I’m trying to do is to track down the > reason why fortran code is performing better than C on the > source code used as benchmark in the main page of the > Julia language in [1]. I wasn’t expecting this huge difference > between fortran and C code on an example as simple as this, so I > tried to investigate, even if I don’t know anything about fortran. > Their source code use int instead of unsigned as in my example. > I don’t know quite why, but this slows down everything both in gcc > and clang, because of something related to how the two different > languages handle overflow in integer arithmetic, I think. When I use > unsigned int, gcc performance matches gfortran’s, and the generated > code is nearly identical, while clang stays way behind because of the > missing tail call elimination. > > Note that even with the modified example where the tail call does > get eliminated, a simple benchmark shows that clang’s code > is still 2-3x times slower than gcc's, I suppose because of the missing > unrolling. > > Note that in that benchmark, Julia seems winning too, and they use > LLVM as the backend so something’s missing... > > Can you help me understand what’s happening? > > Thank you, > Nicola > > [1] http://julialang.org > > > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140508/0c1b1f06/attachment.html>
Nicola Gigante
2014-May-09 08:21 UTC
[LLVMdev] Small problem with the tail call elimination pass
Il giorno 09/mag/2014, alle ore 01:00, Nick Lewycky <nlewycky at google.com> ha scritto:> At the moment TRE runs, the function reads: > > ; Function Attrs: nounwind readnone > define i32 @fib(i32 %n) #0 { > entry: > %cmp = icmp ult i32 %n, 2 > br i1 %cmp, label %cond.end, label %cond.false > > cond.false: ; preds = %entry > %sub = add i32 %n, -1 > %call = tail call i32 @fib(i32 %sub) > %sub1 = add i32 %n, -2 > %call2 = tail call i32 @fib(i32 %sub1) > %add = add i32 %call, %call2 > br label %cond.end > > cond.end: ; preds = %entry, %cond.false > %cond = phi i32 [ %add, %cond.false ], [ %n, %entry ] > ret i32 %cond > } > > TRE starts at the 'ret' and looks upwards and checks all of that block's predecessors to find any that end in unconditional branches. Then it walks upwards in such blocks to find if there's a "TRE candidate" which is a call to itself (but aborts if there are instructions is can't handle between the call and the end of the block), and manages to find %call2 -- and calls EliminateRecursiveTailCall on it. Good job! At this stage, it transforms the "br label %cond.end" into a "ret i32 %add" and updates the phi in %cond.end. There are now *two* returns, the one in %cond.end is "ret i32 %n". > > EliminateRecursiveTailCall detects that TRE'ing this requires performing accumulator recursion elimination. To do this, it checks that "the function containing the specified tail call consistently returns the same runtime-constant value" (at all exit points except for our ret at the end of %cond.false). Now that we have a ret i32 %n, this is false; the safety check finds that the argument for %n can change because we pass in %sub and %sub1 into ourselves instead of always passing %n to ourselves. That's what stops the transformation. > > Your working example makes two modifications, you change "n < 2" to "n <= 2" and you change "? n" to "? 1". It's the second of these that will make the difference, because it changes %cond to: > > %cond = phi i32 [ %add, %cond.false ], [ 1, %entry ] > > Everything proceeds the same way as above, except that when the branch in %cond.false is replaced with a return, the return in %cond.end folds to "ret i32 1", a constant. Being the same value each time, this is simple enough for accumulator recursion elimination. >Thank you very much for the explanation! It all makes sense now. However, the point remains that gcc (compiling down from both C or fortran) can indeed optimize this code despite the non-constant base case. Do you or someone know what reasoning does it apply? Should I file a bug report?> > PS. In case you aren't aware, I recently bemoaned the state of our TRE abilities here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140505/216092.htmlVery interesting!> NickThanks, Nicola -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140509/e56ec15d/attachment.html>