thr3ads.net - llvm dev - [LLVMdev] RFC: Recursive inlining [Feb 2015]

If this information is useful, please help other people find it:
Share via:

James Molloy

2015-Feb-05 10:36 UTC

[LLVMdev] RFC: Recursive inlining

Hi,

The other day on IRC myself, Chandler, Hal and Pablo discussed recursive
inlining. I've pondered a bit since then, and thought I'd get some
background on the list with the aim of stimulating discussion.

Our goal is to inline recursive functions. We have several workloads with
recursive calls, and inlining them to a certain degree can improve
performance by up to 30%.

Inlining recursive functions is a slightly different problem to inlining
non-recursive functions in that the inlining can never terminate, so you
need to choose a factor (number of times) to inline. In this respect it is
similar to loop unrolling, and Chandler suggested we should model it as
such.

We originally thought he meant by this that the mechanics of inlining would
stay the same, but the heuristics would have to be changed. This is the
approach taken by Rugina & Rinard [1], which is almost the sum of the
literature on this topic I could find, which surprised me. However, the
actual suggestion was to remove the recursion entirely, and replace it with
a loop.

Before I get into the mechanics of this, I'd like to explain what the
possible optimization opportunities are for recursive functions turned into
a loop.

1. Unrolling
  - This leads to bigger basic blocks (if they can be merged) and exposes
more pipeline-level parallelism.
  - CSE between iterations.
  - Reduces backbranch overhead (generally negligible impact).
2. Commoning / LICM
  - Recursion doesn't have a vehicle for moving common code out into a
dominating block (compare with LICM for loops).

OK, first example:

    void ex1(i) {
      if (i == 0) return;
      a[i] = i+1;
      ex1(i-1);
    }

gets converted to:

    while (1) {
      if (i == 0) break;
      a[i] = i+1;
      i = i-1;
    }

Now, the above is the simplest case and is pure tail recursion. We can
eliminate this currently with the TailRecursionElimination pass. The more
complex case is where we have non-tail recusion and live variables.

    void ex2(i) {
      if (i == 0) return;
      j = a[i];
      ex2(i-1)
      b[j] = i;
    }

This requires a stack being modelled and could be converted to:

    cnt = 0;
    while (1) {
      if (i == 0) break;
      j = a[i];
      stack[cnt++] = j;
      stack[cnt++] = i;
      ++cnt;
      --i;
    }
    while (cnt) {
      i = stack[--cnt];
      j = stack[--cnt];
      b[j] = i;
    }

The above is as far as the discussion got on IRC. The following are my
half-baked thoughts.

The above transform works because the access pattern to the stack is
strictly linear. Pushes are followed by pops. But I don't think this is the
most interesting case for recursion in real programs. I see (qualitatively,
poorly) 3 reasons for recursion in code that should go fast:

  1. A fibonnacci function in a microbenchmark or compiler shootout. I
think we can safely ignore this usecase...
  2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
  3. Traversing a data structure, where the recursion is simply to get to
the leaves which is where all the fun stuff happens.

Points 2 and 3 both share a trait that they're dividing/fanning out at each
step:

    void ex3(i) {
      if (is_base(i)) {
        f();
        return;
      }
      g(i);
      ex3(i / 2);
      ex3(i / 2 + 1);
    }

The access pattern of such a function is not linear. It is a pre-order walk
of a (binary, in this case) tree. As such we can't use the two-loop
transform above, but would have to form one loop and just manually
implement the stack operations:

    cnt = 0
    stack[cnt++] = i;
    while (1) {
      i = stack[--cnt];
      if (is_base(i)) {
        f(); continue;
      }
      g(i);
      stack[cnt++] = i / 2 + 1;
      stack[cnt++] = i / 2;
    }

OK, it's still a well-formed loop, we can still do useful operations on it
like LICM or, if it's profitable, unrolling.

Now what about a more complex case, like in the FFT butterfly algorithm:

    void ex4(i) {
      if (is_base(i)) {
        f(i); return;
      }
      g(i);
      ex4(i / 2);
      ex4(i / 2 + 1);
      h(i);
    }

Here, we have to do work after the last recursive call. This presents a
problem, because we don't have any of the caller's context when dealing
with a node. Is a node N the first recursive call? or the second?
Call-stack recursion can save two things: the state of variables and the
return address which serves as another kind of state. The obvious answer is
to turn this into a sort of state machine:

    cnt = 0
    stack[cnt++] = i;
    stack[cnt++] = STATE_1;
    while (1) {
      state = stack[--cnt];
      i = stack[--cnt];

      if (state == STATE_FINISH) {
        h(i);
        continue;
      }

      if (is_base(i)) {
        f(); continue;
      }

      g(i);
      stack[cnt++] = i; // Push ourself again, in the final state this time
      stack[cnt++] = STATE_FINISH;
      stack[cnt++] = i / 2 + 1;
      stack[cnt++] = STATE_1;
      stack[cnt++] = i / 2;
      stack[cnt++] = STATE_1;
    }

This solution can be generalised for if there is code in between the
recursive calls (just add more states). It's getting more complex in terms
of control flow though - the question is, is it worth it?

Unrolling this loop is probably not going to usefully create larger basic
blocks due to how complex the control flow is. I suppose we have the
opportunity, if we unroll, of CSE and jump threading values through
equivalent blocks that may gain us time. However, we can still do LICM
which is probably a major improvement, and there are no calls/returns to
speculate.

This approach would have to be extended if the recursive calls didn't
return void (for example it was a recursive reduction), but it's not
impossible.

Sorry for the long post. What do people think?

Cheers,

James

[1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/53b2cf8c/attachment.html>

Chandler Carruth

2015-Feb-05 10:57 UTC

head link

[LLVMdev] RFC: Recursive inlining

First off, thanks for the examples. They are very nice.

Breaking it down into an explicit state machine is I think exactly the
right way to model this.

On Thu, Feb 5, 2015 at 2:36 AM, James Molloy <james at jamesmolloy.co.uk>
wrote:
> This solution can be generalised for if there is code in between the
> recursive calls (just add more states). It's getting more complex in
terms
> of control flow though - the question is, is it worth it?
>
> Unrolling this loop is probably not going to usefully create larger basic
> blocks due to how complex the control flow is. I suppose we have the
> opportunity, if we unroll, of CSE and jump threading values through
> equivalent blocks that may gain us time. However, we can still do LICM
> which is probably a major improvement, and there are no calls/returns to
> speculate.
>
> This approach would have to be extended if the recursive calls didn't
> return void (for example it was a recursive reduction), but it's not
> impossible.
>
> Sorry for the long post. What do people think?
>
Personally, I think this would be an amazing transformation for LLVM to do.
But I agree that it gets very complex in the limit. I suspect there will
need to be some complexity bound on doing this transformation, but as of
yet I'm not sure what it would be. Here is what is in my head.

Let's imagine we do this transform. Let's imagine that *no other
optimizations fire*. How close can we make the generated code resemble what
would be executed by actually calling the function recursively? Can we use
the actual stack and lower the adjustments much like we would were we to
imagine the dynamic trace of instructions in the recursive variant?

My suspicion is that for a decent chunk of real world recursion, we can
transform it into a loop that is not significantly lower to execute. We can
use the same stack adjustment code that would be executed in a recursive
call, etc. My further suspicion is that the complexity of control flow is
not the critical concern here, but the amount of stack usage, and how much
work we have to do to use a minimal amount of the stack. I could be
completely wrong about this, however, looking at your state machine code my
first impression is that the code is actually not bad... except for the
stack manipulation. I think we'll actually generate very reasonable code
for the CFGs here.

So, if we can do this transform in a way that produces generated code that
essentially executes the same dynamic trace of instructions that the
recursive code would execute, I think we're in *very* safe territory. The
primary code size will be switching from the call-stack approach to this
manual stack approach. It'll be like the cost of peeling the outer
iteration. Unless the code size is of extreme concern, I don't think there
is much downside.

As for the upside, I think it is huge. LICM alone is a game-changer. GVN
starts to work. PRE can be applied. Will SCEV be able to reason about the
trip count? probably not. But the entire rest of the optimizer essentially
gets turned *on* by this change.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/f37fe124/attachment.html>

Jeremy Lakeman

2015-Feb-05 12:23 UTC

head link

[LLVMdev] RFC: Recursive inlining

On Thu, Feb 5, 2015 at 9:06 PM, James Molloy <james at jamesmolloy.co.uk>
wrote:
> Now what about a more complex case, like in the FFT butterfly algorithm:
>
>     void ex4(i) {
>       if (is_base(i)) {
>         f(i); return;
>       }
>       g(i);
>       ex4(i / 2);
>       ex4(i / 2 + 1);
>       h(i);
>     }
>
>
>This sounds like a very similar process to transforming a C# IEnumerable's
"yield" into a state machine. There might be some relevant literature
in
this area.
So essentially we just need to;
- replace call's with a stack push and jump to entry block
- replace return's with a stack pop and jump to the state following the
call or return.
And try to avoid pushing any invariant values.

Something along these lines?

struct stack{
  state;
  i;
}

void ex4(i) {
entry:
struct stack *stack = intrinsic.stackptr() // ??
goto state_0;

state_0:
i0 = phi [i, i1, i2]
stack0 = phi [stack, stack1, stack2]
if (is_base(i)){
  f(i);
  goto state_ret;
}
g(i);
i1 = i0 / 2;
stack0->state=1;
stack0->i = i0;
stack1 = stack0 + 1;
goto state_0;

state_1:
i2 = i3 / 2 + 1;
stack0->state=2;
stack0->i = i;
stack2 = stack0 + 1;
goto state_0;

state_2:
h(i3);
goto state_ret;

state_ret:
stack3 = phi [stack0, stack3]
if( stack3 == stack ) return;
stack4 = stack3 - 1;
i2 = stack4->i;
switch(stack4->state){
case 1: goto state_1;
case 2: goto state_2;
default: unreachable;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/37c66597/attachment.html>

Jeremy Lakeman

2015-Feb-05 13:56 UTC

head link

[LLVMdev] RFC: Recursive inlining

Thinking about this some more I realise I could explain this more clearly.
I think we can manipulate the bitcode fairly easily, and in the same way
for all recursive functions.

before;

void func(%arg1, ...) {
entry:
  // ...

terminate:
  // ...
  ret;

recursive_call:
  // ...
  call func(%arg1_tmp, ...)
  // ...

end:
  // ...
  ret;
}

after;


void func(%arg1_top, ...) {
entry:
  push state_end
  br recurse;

recurse:
  %arg1 = phi %arg1_top, %arg1_tmp, ....
  // ...

terminate:
  // ...
  br state_ret;

recursive_call:
  // ...
  push live values
  push state 1
  br recurse;

state1:
  pop live values
  // ...

end:
  // ...
  br state_ret;

state_ret:
  %state = pop state
  switch %state{
    case state_end: br finish;
    case 1: br state1;
    .... other states ...
  }

finish:
  ret;
}


On Thu, Feb 5, 2015 at 10:53 PM, Jeremy Lakeman <Jeremy.Lakeman at
gmail.com>
wrote:
>
>
> On Thu, Feb 5, 2015 at 9:06 PM, James Molloy <james at
jamesmolloy.co.uk>
> wrote:
>
>> Now what about a more complex case, like in the FFT butterfly
algorithm:
>>
>>     void ex4(i) {
>>       if (is_base(i)) {
>>         f(i); return;
>>       }
>>       g(i);
>>       ex4(i / 2);
>>       ex4(i / 2 + 1);
>>       h(i);
>>     }
>>
>>
>>
> This sounds like a very similar process to transforming a C#
IEnumerable's
> "yield" into a state machine. There might be some relevant
literature in
> this area.
> So essentially we just need to;
> - replace call's with a stack push and jump to entry block
> - replace return's with a stack pop and jump to the state following the
> call or return.
> And try to avoid pushing any invariant values.
>
> Something along these lines?
>
> struct stack{
>   state;
>   i;
> }
>
> void ex4(i) {
> entry:
> struct stack *stack = intrinsic.stackptr() // ??
> goto state_0;
>
> state_0:
> i0 = phi [i, i1, i2]
> stack0 = phi [stack, stack1, stack2]
> if (is_base(i)){
>   f(i);
>   goto state_ret;
> }
> g(i);
> i1 = i0 / 2;
> stack0->state=1;
> stack0->i = i0;
> stack1 = stack0 + 1;
> goto state_0;
>
> state_1:
> i2 = i3 / 2 + 1;
> stack0->state=2;
> stack0->i = i;
> stack2 = stack0 + 1;
> goto state_0;
>
> state_2:
> h(i3);
> goto state_ret;
>
> state_ret:
> stack3 = phi [stack0, stack3]
> if( stack3 == stack ) return;
> stack4 = stack3 - 1;
> i2 = stack4->i;
> switch(stack4->state){
> case 1: goto state_1;
> case 2: goto state_2;
> default: unreachable;
> }
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150206/8e048daa/attachment.html>

Hal Finkel

2015-Feb-05 14:48 UTC

head link

[LLVMdev] RFC: Recursive inlining

----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at
google.com>, "Hal Finkel" <hfinkel at anl.gov>, "pablo
barrio"
> <pablo.barrio at arm.com>
> Sent: Thursday, February 5, 2015 4:36:16 AM
> Subject: RFC: Recursive inlining
> 
> 
> Hi,
> 
> 
> The other day on IRC myself, Chandler, Hal and Pablo discussed
> recursive inlining. I've pondered a bit since then, and thought I'd
> get some background on the list with the aim of stimulating
> discussion.
> 
> 
> Our goal is to inline recursive functions. We have several workloads
> with recursive calls, and inlining them to a certain degree can
> improve performance by up to 30%.
> 
> 
> Inlining recursive functions is a slightly different problem to
> inlining non-recursive functions in that the inlining can never
> terminate, so you need to choose a factor (number of times) to
> inline. In this respect it is similar to loop unrolling, and
> Chandler suggested we should model it as such.
> 
> 
> We originally thought he meant by this that the mechanics of inlining
> would stay the same, but the heuristics would have to be changed.
> This is the approach taken by Rugina & Rinard [1], which is almost
> the sum of the literature on this topic I could find, which
> surprised me. However, the actual suggestion was to remove the
> recursion entirely, and replace it with a loop.
> 
> 
> Before I get into the mechanics of this, I'd like to explain what the
> possible optimization opportunities are for recursive functions
> turned into a loop.
> 
> 
> 1. Unrolling
> - This leads to bigger basic blocks (if they can be merged) and
> exposes more pipeline-level parallelism.
> - CSE between iterations.
> - Reduces backbranch overhead (generally negligible impact).
> 2. Commoning / LICM
> - Recursion doesn't have a vehicle for moving common code out into a
> dominating block (compare with LICM for loops).
> 
> 
> OK, first example:
> 
> 
> void ex1(i) {
> if (i == 0) return;
> a[i] = i+1;
> ex1(i-1);
> }
> 
> 
> gets converted to:
> 
> 
> while (1) {
> if (i == 0) break;
> a[i] = i+1;
> i = i-1;
> }
> 
> 
> Now, the above is the simplest case and is pure tail recursion. We
> can eliminate this currently with the TailRecursionElimination pass.
> The more complex case is where we have non-tail recusion and live
> variables.
> 
> 
> void ex2(i) {
> if (i == 0) return;
> j = a[i];
> ex2(i-1)
> b[j] = i;
> }
> 
> 
> This requires a stack being modelled and could be converted to:
> 
> 
> cnt = 0;
> while (1) {
> if (i == 0) break;
> j = a[i];
> stack[cnt++] = j;
> stack[cnt++] = i;
> ++cnt;
> --i;
> }
> while (cnt) {
> i = stack[--cnt];
> j = stack[--cnt];
> b[j] = i;
> }
> 
> 
> The above is as far as the discussion got on IRC. The following are
> my half-baked thoughts.
> 
> 
> The above transform works because the access pattern to the stack is
> strictly linear. Pushes are followed by pops. But I don't think this
> is the most interesting case for recursion in real programs. I see
> (qualitatively, poorly) 3 reasons for recursion in code that should
> go fast:
> 
> 
> 1. A fibonnacci function in a microbenchmark or compiler shootout. I
> think we can safely ignore this usecase...
> 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> 3. Traversing a data structure, where the recursion is simply to get
> to the leaves which is where all the fun stuff happens.
> 
> 
> Points 2 and 3 both share a trait that they're dividing/fanning out
> at each step:
> 
> 
> void ex3(i) {
> if (is_base(i)) {
> f();
> return;
> }
> g(i);
> ex3(i / 2);
> ex3(i / 2 + 1);
> }
> 
> 
> The access pattern of such a function is not linear. It is a
> pre-order walk of a (binary, in this case) tree. As such we can't
> use the two-loop transform above, but would have to form one loop
> and just manually implement the stack operations:
> 
> 
> cnt = 0
> stack[cnt++] = i;
> while (1) {
> i = stack[--cnt];
> if (is_base(i)) {
> f(); continue;
> }
> g(i);
> stack[cnt++] = i / 2 + 1;
> stack[cnt++] = i / 2;
> }
> 
> 
> OK, it's still a well-formed loop, we can still do useful operations
> on it like LICM or, if it's profitable, unrolling.
> 
> 
> Now what about a more complex case, like in the FFT butterfly
> algorithm:
> 
> 
> void ex4(i) {
> if (is_base(i)) {
> f(i); return;
> }
> g(i);
> ex4(i / 2);
> ex4(i / 2 + 1);
> h(i);
> }
> 
> 
> Here, we have to do work after the last recursive call. This presents
> a problem, because we don't have any of the caller's context when
> dealing with a node. Is a node N the first recursive call? or the
> second? Call-stack recursion can save two things: the state of
> variables and the return address which serves as another kind of
> state. The obvious answer is to turn this into a sort of state
> machine:
> 
> 
> cnt = 0
> stack[cnt++] = i;
> stack[cnt++] = STATE_1;
> while (1) {
> state = stack[--cnt];
> i = stack[--cnt];
> 
> 
> if (state == STATE_FINISH) {
> h(i);
> continue;
> }
> 
> 
> if (is_base(i)) {
> f(); continue;
> }
> 
> 
> g(i);
> stack[cnt++] = i; // Push ourself again, in the final state this time
> stack[cnt++] = STATE_FINISH;
> stack[cnt++] = i / 2 + 1;
> stack[cnt++] = STATE_1;
> stack[cnt++] = i / 2;
> stack[cnt++] = STATE_1;
> }
As we had briefly mentioned on IRC, one way of forming this 'stack', and
its associated 'cnt' variable, is to use dynamic stack allocation. You
can do this in general at the IR level (using allocas that just happen not to be
in the entry block). If each stack block holds a pointer to the previous one (as
a linked list), then you just need to maintain a pointer to the current block,
and you don't need a separate 'cnt' variable (you might also keep a
pointer to the next block for allocation reuse). Is that what you had in mind?

 -Hal
> 
> 
> This solution can be generalised for if there is code in between the
> recursive calls (just add more states). It's getting more complex in
> terms of control flow though - the question is, is it worth it?
> 
> 
> Unrolling this loop is probably not going to usefully create larger
> basic blocks due to how complex the control flow is. I suppose we
> have the opportunity, if we unroll, of CSE and jump threading values
> through equivalent blocks that may gain us time. However, we can
> still do LICM which is probably a major improvement, and there are
> no calls/returns to speculate.
> 
> 
> This approach would have to be extended if the recursive calls didn't
> return void (for example it was a recursive reduction), but it's not
> impossible.
> 
> 
> Sorry for the long post. What do people think?
> 
> 
> Cheers,
> 
> 
> James
> 
> 
> [1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

James Molloy

2015-Feb-05 15:18 UTC

head link

[LLVMdev] RFC: Recursive inlining

Hi Hal,
> As we had briefly mentioned on IRC, one way of forming this
'stack', andits associated 'cnt' variable, is to use dynamic stack allocation.

I hadn't really reached a decision on the mechanics yet. However, your
suggestion while it can work within the current abilities of the IR, has
the disadvantage that it is using an extra stack slot for the link pointer.
I think if we are going to apply this optimization, we must make sure we
aren't going to blow the stack.

As users may already be just close enough to the max-stack boundary that
their applications don't crash, I think we should aim as much as possible
to keep the stack usage no higher than the recursive equivalent. Holding
the "state" is the equivalent of a return address, so adding a stack
link
pointer would be the equivalent of a frame address I suppose... That could
be acceptable?

As I say, I haven't thought about the deep mechanics just yet.

James

On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc
at google.com>,
> "Hal Finkel" <hfinkel at anl.gov>, "pablo barrio"
> > <pablo.barrio at arm.com>
> > Sent: Thursday, February 5, 2015 4:36:16 AM
> > Subject: RFC: Recursive inlining
> >
> >
> > Hi,
> >
> >
> > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > recursive inlining. I've pondered a bit since then, and thought
I'd
> > get some background on the list with the aim of stimulating
> > discussion.
> >
> >
> > Our goal is to inline recursive functions. We have several workloads
> > with recursive calls, and inlining them to a certain degree can
> > improve performance by up to 30%.
> >
> >
> > Inlining recursive functions is a slightly different problem to
> > inlining non-recursive functions in that the inlining can never
> > terminate, so you need to choose a factor (number of times) to
> > inline. In this respect it is similar to loop unrolling, and
> > Chandler suggested we should model it as such.
> >
> >
> > We originally thought he meant by this that the mechanics of inlining
> > would stay the same, but the heuristics would have to be changed.
> > This is the approach taken by Rugina & Rinard [1], which is almost
> > the sum of the literature on this topic I could find, which
> > surprised me. However, the actual suggestion was to remove the
> > recursion entirely, and replace it with a loop.
> >
> >
> > Before I get into the mechanics of this, I'd like to explain what
the
> > possible optimization opportunities are for recursive functions
> > turned into a loop.
> >
> >
> > 1. Unrolling
> > - This leads to bigger basic blocks (if they can be merged) and
> > exposes more pipeline-level parallelism.
> > - CSE between iterations.
> > - Reduces backbranch overhead (generally negligible impact).
> > 2. Commoning / LICM
> > - Recursion doesn't have a vehicle for moving common code out into
a
> > dominating block (compare with LICM for loops).
> >
> >
> > OK, first example:
> >
> >
> > void ex1(i) {
> > if (i == 0) return;
> > a[i] = i+1;
> > ex1(i-1);
> > }
> >
> >
> > gets converted to:
> >
> >
> > while (1) {
> > if (i == 0) break;
> > a[i] = i+1;
> > i = i-1;
> > }
> >
> >
> > Now, the above is the simplest case and is pure tail recursion. We
> > can eliminate this currently with the TailRecursionElimination pass.
> > The more complex case is where we have non-tail recusion and live
> > variables.
> >
> >
> > void ex2(i) {
> > if (i == 0) return;
> > j = a[i];
> > ex2(i-1)
> > b[j] = i;
> > }
> >
> >
> > This requires a stack being modelled and could be converted to:
> >
> >
> > cnt = 0;
> > while (1) {
> > if (i == 0) break;
> > j = a[i];
> > stack[cnt++] = j;
> > stack[cnt++] = i;
> > ++cnt;
> > --i;
> > }
> > while (cnt) {
> > i = stack[--cnt];
> > j = stack[--cnt];
> > b[j] = i;
> > }
> >
> >
> > The above is as far as the discussion got on IRC. The following are
> > my half-baked thoughts.
> >
> >
> > The above transform works because the access pattern to the stack is
> > strictly linear. Pushes are followed by pops. But I don't think
this
> > is the most interesting case for recursion in real programs. I see
> > (qualitatively, poorly) 3 reasons for recursion in code that should
> > go fast:
> >
> >
> > 1. A fibonnacci function in a microbenchmark or compiler shootout. I
> > think we can safely ignore this usecase...
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > 3. Traversing a data structure, where the recursion is simply to get
> > to the leaves which is where all the fun stuff happens.
> >
> >
> > Points 2 and 3 both share a trait that they're dividing/fanning
out
> > at each step:
> >
> >
> > void ex3(i) {
> > if (is_base(i)) {
> > f();
> > return;
> > }
> > g(i);
> > ex3(i / 2);
> > ex3(i / 2 + 1);
> > }
> >
> >
> > The access pattern of such a function is not linear. It is a
> > pre-order walk of a (binary, in this case) tree. As such we can't
> > use the two-loop transform above, but would have to form one loop
> > and just manually implement the stack operations:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > while (1) {
> > i = stack[--cnt];
> > if (is_base(i)) {
> > f(); continue;
> > }
> > g(i);
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = i / 2;
> > }
> >
> >
> > OK, it's still a well-formed loop, we can still do useful
operations
> > on it like LICM or, if it's profitable, unrolling.
> >
> >
> > Now what about a more complex case, like in the FFT butterfly
> > algorithm:
> >
> >
> > void ex4(i) {
> > if (is_base(i)) {
> > f(i); return;
> > }
> > g(i);
> > ex4(i / 2);
> > ex4(i / 2 + 1);
> > h(i);
> > }
> >
> >
> > Here, we have to do work after the last recursive call. This presents
> > a problem, because we don't have any of the caller's context
when
> > dealing with a node. Is a node N the first recursive call? or the
> > second? Call-stack recursion can save two things: the state of
> > variables and the return address which serves as another kind of
> > state. The obvious answer is to turn this into a sort of state
> > machine:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > stack[cnt++] = STATE_1;
> > while (1) {
> > state = stack[--cnt];
> > i = stack[--cnt];
> >
> >
> > if (state == STATE_FINISH) {
> > h(i);
> > continue;
> > }
> >
> >
> > if (is_base(i)) {
> > f(); continue;
> > }
> >
> >
> > g(i);
> > stack[cnt++] = i; // Push ourself again, in the final state this time
> > stack[cnt++] = STATE_FINISH;
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = STATE_1;
> > stack[cnt++] = i / 2;
> > stack[cnt++] = STATE_1;
> > }
>
> As we had briefly mentioned on IRC, one way of forming this
'stack', and
> its associated 'cnt' variable, is to use dynamic stack allocation.
You can
> do this in general at the IR level (using allocas that just happen not to
> be in the entry block). If each stack block holds a pointer to the previous
> one (as a linked list), then you just need to maintain a pointer to the
> current block, and you don't need a separate 'cnt' variable
(you might also
> keep a pointer to the next block for allocation reuse). Is that what you
> had in mind?
>
>  -Hal
>
> >
> >
> > This solution can be generalised for if there is code in between the
> > recursive calls (just add more states). It's getting more complex
in
> > terms of control flow though - the question is, is it worth it?
> >
> >
> > Unrolling this loop is probably not going to usefully create larger
> > basic blocks due to how complex the control flow is. I suppose we
> > have the opportunity, if we unroll, of CSE and jump threading values
> > through equivalent blocks that may gain us time. However, we can
> > still do LICM which is probably a major improvement, and there are
> > no calls/returns to speculate.
> >
> >
> > This approach would have to be extended if the recursive calls
didn't
> > return void (for example it was a recursive reduction), but it's
not
> > impossible.
> >
> >
> > Sorry for the long post. What do people think?
> >
> >
> > Cheers,
> >
> >
> > James
> >
> >
> > [1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/596b77a7/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Feb 2015 - [LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

Apparently Analagous Threads