thr3ads.net - llvm dev - [LLVMdev] RFC: Recursive inlining [Feb 2015]

If this information is useful, please help other people find it:
Share via:

James Molloy

2015-Feb-05 15:18 UTC

[LLVMdev] RFC: Recursive inlining

Hi Hal,
> As we had briefly mentioned on IRC, one way of forming this
'stack', andits associated 'cnt' variable, is to use dynamic stack allocation.

I hadn't really reached a decision on the mechanics yet. However, your
suggestion while it can work within the current abilities of the IR, has
the disadvantage that it is using an extra stack slot for the link pointer.
I think if we are going to apply this optimization, we must make sure we
aren't going to blow the stack.

As users may already be just close enough to the max-stack boundary that
their applications don't crash, I think we should aim as much as possible
to keep the stack usage no higher than the recursive equivalent. Holding
the "state" is the equivalent of a return address, so adding a stack
link
pointer would be the equivalent of a frame address I suppose... That could
be acceptable?

As I say, I haven't thought about the deep mechanics just yet.

James

On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc
at google.com>,
> "Hal Finkel" <hfinkel at anl.gov>, "pablo barrio"
> > <pablo.barrio at arm.com>
> > Sent: Thursday, February 5, 2015 4:36:16 AM
> > Subject: RFC: Recursive inlining
> >
> >
> > Hi,
> >
> >
> > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > recursive inlining. I've pondered a bit since then, and thought
I'd
> > get some background on the list with the aim of stimulating
> > discussion.
> >
> >
> > Our goal is to inline recursive functions. We have several workloads
> > with recursive calls, and inlining them to a certain degree can
> > improve performance by up to 30%.
> >
> >
> > Inlining recursive functions is a slightly different problem to
> > inlining non-recursive functions in that the inlining can never
> > terminate, so you need to choose a factor (number of times) to
> > inline. In this respect it is similar to loop unrolling, and
> > Chandler suggested we should model it as such.
> >
> >
> > We originally thought he meant by this that the mechanics of inlining
> > would stay the same, but the heuristics would have to be changed.
> > This is the approach taken by Rugina & Rinard [1], which is almost
> > the sum of the literature on this topic I could find, which
> > surprised me. However, the actual suggestion was to remove the
> > recursion entirely, and replace it with a loop.
> >
> >
> > Before I get into the mechanics of this, I'd like to explain what
the
> > possible optimization opportunities are for recursive functions
> > turned into a loop.
> >
> >
> > 1. Unrolling
> > - This leads to bigger basic blocks (if they can be merged) and
> > exposes more pipeline-level parallelism.
> > - CSE between iterations.
> > - Reduces backbranch overhead (generally negligible impact).
> > 2. Commoning / LICM
> > - Recursion doesn't have a vehicle for moving common code out into
a
> > dominating block (compare with LICM for loops).
> >
> >
> > OK, first example:
> >
> >
> > void ex1(i) {
> > if (i == 0) return;
> > a[i] = i+1;
> > ex1(i-1);
> > }
> >
> >
> > gets converted to:
> >
> >
> > while (1) {
> > if (i == 0) break;
> > a[i] = i+1;
> > i = i-1;
> > }
> >
> >
> > Now, the above is the simplest case and is pure tail recursion. We
> > can eliminate this currently with the TailRecursionElimination pass.
> > The more complex case is where we have non-tail recusion and live
> > variables.
> >
> >
> > void ex2(i) {
> > if (i == 0) return;
> > j = a[i];
> > ex2(i-1)
> > b[j] = i;
> > }
> >
> >
> > This requires a stack being modelled and could be converted to:
> >
> >
> > cnt = 0;
> > while (1) {
> > if (i == 0) break;
> > j = a[i];
> > stack[cnt++] = j;
> > stack[cnt++] = i;
> > ++cnt;
> > --i;
> > }
> > while (cnt) {
> > i = stack[--cnt];
> > j = stack[--cnt];
> > b[j] = i;
> > }
> >
> >
> > The above is as far as the discussion got on IRC. The following are
> > my half-baked thoughts.
> >
> >
> > The above transform works because the access pattern to the stack is
> > strictly linear. Pushes are followed by pops. But I don't think
this
> > is the most interesting case for recursion in real programs. I see
> > (qualitatively, poorly) 3 reasons for recursion in code that should
> > go fast:
> >
> >
> > 1. A fibonnacci function in a microbenchmark or compiler shootout. I
> > think we can safely ignore this usecase...
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > 3. Traversing a data structure, where the recursion is simply to get
> > to the leaves which is where all the fun stuff happens.
> >
> >
> > Points 2 and 3 both share a trait that they're dividing/fanning
out
> > at each step:
> >
> >
> > void ex3(i) {
> > if (is_base(i)) {
> > f();
> > return;
> > }
> > g(i);
> > ex3(i / 2);
> > ex3(i / 2 + 1);
> > }
> >
> >
> > The access pattern of such a function is not linear. It is a
> > pre-order walk of a (binary, in this case) tree. As such we can't
> > use the two-loop transform above, but would have to form one loop
> > and just manually implement the stack operations:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > while (1) {
> > i = stack[--cnt];
> > if (is_base(i)) {
> > f(); continue;
> > }
> > g(i);
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = i / 2;
> > }
> >
> >
> > OK, it's still a well-formed loop, we can still do useful
operations
> > on it like LICM or, if it's profitable, unrolling.
> >
> >
> > Now what about a more complex case, like in the FFT butterfly
> > algorithm:
> >
> >
> > void ex4(i) {
> > if (is_base(i)) {
> > f(i); return;
> > }
> > g(i);
> > ex4(i / 2);
> > ex4(i / 2 + 1);
> > h(i);
> > }
> >
> >
> > Here, we have to do work after the last recursive call. This presents
> > a problem, because we don't have any of the caller's context
when
> > dealing with a node. Is a node N the first recursive call? or the
> > second? Call-stack recursion can save two things: the state of
> > variables and the return address which serves as another kind of
> > state. The obvious answer is to turn this into a sort of state
> > machine:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > stack[cnt++] = STATE_1;
> > while (1) {
> > state = stack[--cnt];
> > i = stack[--cnt];
> >
> >
> > if (state == STATE_FINISH) {
> > h(i);
> > continue;
> > }
> >
> >
> > if (is_base(i)) {
> > f(); continue;
> > }
> >
> >
> > g(i);
> > stack[cnt++] = i; // Push ourself again, in the final state this time
> > stack[cnt++] = STATE_FINISH;
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = STATE_1;
> > stack[cnt++] = i / 2;
> > stack[cnt++] = STATE_1;
> > }
>
> As we had briefly mentioned on IRC, one way of forming this
'stack', and
> its associated 'cnt' variable, is to use dynamic stack allocation.
You can
> do this in general at the IR level (using allocas that just happen not to
> be in the entry block). If each stack block holds a pointer to the previous
> one (as a linked list), then you just need to maintain a pointer to the
> current block, and you don't need a separate 'cnt' variable
(you might also
> keep a pointer to the next block for allocation reuse). Is that what you
> had in mind?
>
>  -Hal
>
> >
> >
> > This solution can be generalised for if there is code in between the
> > recursive calls (just add more states). It's getting more complex
in
> > terms of control flow though - the question is, is it worth it?
> >
> >
> > Unrolling this loop is probably not going to usefully create larger
> > basic blocks due to how complex the control flow is. I suppose we
> > have the opportunity, if we unroll, of CSE and jump threading values
> > through equivalent blocks that may gain us time. However, we can
> > still do LICM which is probably a major improvement, and there are
> > no calls/returns to speculate.
> >
> >
> > This approach would have to be extended if the recursive calls
didn't
> > return void (for example it was a recursive reduction), but it's
not
> > impossible.
> >
> >
> > Sorry for the long post. What do people think?
> >
> >
> > Cheers,
> >
> >
> > James
> >
> >
> > [1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/596b77a7/attachment.html>

Pablo Barrio

2015-Feb-05 16:18 UTC

head link

[LLVMdev] RFC: Recursive inlining

Hi all,

 

This thread is really interesting. I’m just trying to understand here, so please
correct me if I’m wrong.

 

Wouldn’t we potentially save stack usage just by turning the recursion into
loops? For example, at the assembly level, the ABI often requires to preserve
some registers between calls which translate into stack pushes, and we will get
rid of that. My understanding is that this transform does the same thing as the
lowering to assembly does for function calls, with the exception that all the
architecture-independent passes will have a broader context for analysis. So I
would expect to even improve the stack usage.

 

Otherwise, we could analyze the expected stack usage for a single recursive
call. Then, compare it to the expected usage after the transformation for each
loop iteration. This could be used as a heuristic to decide whether to do the
conversion or not? E.g. we only allow to increase the stack growth by 1.2x.

 

Thanks,

Pablo

 

From: James Molloy [mailto:james at jamesmolloy.co.uk] 
Sent: 05 February 2015 15:19
To: Hal Finkel
Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio
Subject: Re: RFC: Recursive inlining

 

Hi Hal,

 
> As we had briefly mentioned on IRC, one way of forming this
'stack', and its associated 'cnt' variable, is to use dynamic
stack allocation.
 

I hadn't really reached a decision on the mechanics yet. However, your
suggestion while it can work within the current abilities of the IR, has the
disadvantage that it is using an extra stack slot for the link pointer. I think
if we are going to apply this optimization, we must make sure we aren't
going to blow the stack.

 

As users may already be just close enough to the max-stack boundary that their
applications don't crash, I think we should aim as much as possible to keep
the stack usage no higher than the recursive equivalent. Holding the
"state" is the equivalent of a return address, so adding a stack link
pointer would be the equivalent of a frame address I suppose... That could be
acceptable?

 

As I say, I haven't thought about the deep mechanics just yet.

 

James

 

On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel <hfinkel at anl.gov> wrote:

----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at
google.com>, "Hal Finkel" <hfinkel at anl.gov>, "pablo
barrio"
> <pablo.barrio at arm.com>
> Sent: Thursday, February 5, 2015 4:36:16 AM
> Subject: RFC: Recursive inlining
>
>
> Hi,
>
>
> The other day on IRC myself, Chandler, Hal and Pablo discussed
> recursive inlining. I've pondered a bit since then, and thought I'd
> get some background on the list with the aim of stimulating
> discussion.
>
>
> Our goal is to inline recursive functions. We have several workloads
> with recursive calls, and inlining them to a certain degree can
> improve performance by up to 30%.
>
>
> Inlining recursive functions is a slightly different problem to
> inlining non-recursive functions in that the inlining can never
> terminate, so you need to choose a factor (number of times) to
> inline. In this respect it is similar to loop unrolling, and
> Chandler suggested we should model it as such.
>
>
> We originally thought he meant by this that the mechanics of inlining
> would stay the same, but the heuristics would have to be changed.
> This is the approach taken by Rugina & Rinard [1], which is almost
> the sum of the literature on this topic I could find, which
> surprised me. However, the actual suggestion was to remove the
> recursion entirely, and replace it with a loop.
>
>
> Before I get into the mechanics of this, I'd like to explain what the
> possible optimization opportunities are for recursive functions
> turned into a loop.
>
>
> 1. Unrolling
> - This leads to bigger basic blocks (if they can be merged) and
> exposes more pipeline-level parallelism.
> - CSE between iterations.
> - Reduces backbranch overhead (generally negligible impact).
> 2. Commoning / LICM
> - Recursion doesn't have a vehicle for moving common code out into a
> dominating block (compare with LICM for loops).
>
>
> OK, first example:
>
>
> void ex1(i) {
> if (i == 0) return;
> a[i] = i+1;
> ex1(i-1);
> }
>
>
> gets converted to:
>
>
> while (1) {
> if (i == 0) break;
> a[i] = i+1;
> i = i-1;
> }
>
>
> Now, the above is the simplest case and is pure tail recursion. We
> can eliminate this currently with the TailRecursionElimination pass.
> The more complex case is where we have non-tail recusion and live
> variables.
>
>
> void ex2(i) {
> if (i == 0) return;
> j = a[i];
> ex2(i-1)
> b[j] = i;
> }
>
>
> This requires a stack being modelled and could be converted to:
>
>
> cnt = 0;
> while (1) {
> if (i == 0) break;
> j = a[i];
> stack[cnt++] = j;
> stack[cnt++] = i;
> ++cnt;
> --i;
> }
> while (cnt) {
> i = stack[--cnt];
> j = stack[--cnt];
> b[j] = i;
> }
>
>
> The above is as far as the discussion got on IRC. The following are
> my half-baked thoughts.
>
>
> The above transform works because the access pattern to the stack is
> strictly linear. Pushes are followed by pops. But I don't think this
> is the most interesting case for recursion in real programs. I see
> (qualitatively, poorly) 3 reasons for recursion in code that should
> go fast:
>
>
> 1. A fibonnacci function in a microbenchmark or compiler shootout. I
> think we can safely ignore this usecase...
> 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> 3. Traversing a data structure, where the recursion is simply to get
> to the leaves which is where all the fun stuff happens.
>
>
> Points 2 and 3 both share a trait that they're dividing/fanning out
> at each step:
>
>
> void ex3(i) {
> if (is_base(i)) {
> f();
> return;
> }
> g(i);
> ex3(i / 2);
> ex3(i / 2 + 1);
> }
>
>
> The access pattern of such a function is not linear. It is a
> pre-order walk of a (binary, in this case) tree. As such we can't
> use the two-loop transform above, but would have to form one loop
> and just manually implement the stack operations:
>
>
> cnt = 0
> stack[cnt++] = i;
> while (1) {
> i = stack[--cnt];
> if (is_base(i)) {
> f(); continue;
> }
> g(i);
> stack[cnt++] = i / 2 + 1;
> stack[cnt++] = i / 2;
> }
>
>
> OK, it's still a well-formed loop, we can still do useful operations
> on it like LICM or, if it's profitable, unrolling.
>
>
> Now what about a more complex case, like in the FFT butterfly
> algorithm:
>
>
> void ex4(i) {
> if (is_base(i)) {
> f(i); return;
> }
> g(i);
> ex4(i / 2);
> ex4(i / 2 + 1);
> h(i);
> }
>
>
> Here, we have to do work after the last recursive call. This presents
> a problem, because we don't have any of the caller's context when
> dealing with a node. Is a node N the first recursive call? or the
> second? Call-stack recursion can save two things: the state of
> variables and the return address which serves as another kind of
> state. The obvious answer is to turn this into a sort of state
> machine:
>
>
> cnt = 0
> stack[cnt++] = i;
> stack[cnt++] = STATE_1;
> while (1) {
> state = stack[--cnt];
> i = stack[--cnt];
>
>
> if (state == STATE_FINISH) {
> h(i);
> continue;
> }
>
>
> if (is_base(i)) {
> f(); continue;
> }
>
>
> g(i);
> stack[cnt++] = i; // Push ourself again, in the final state this time
> stack[cnt++] = STATE_FINISH;
> stack[cnt++] = i / 2 + 1;
> stack[cnt++] = STATE_1;
> stack[cnt++] = i / 2;
> stack[cnt++] = STATE_1;
> }
As we had briefly mentioned on IRC, one way of forming this 'stack', and
its associated 'cnt' variable, is to use dynamic stack allocation. You
can do this in general at the IR level (using allocas that just happen not to be
in the entry block). If each stack block holds a pointer to the previous one (as
a linked list), then you just need to maintain a pointer to the current block,
and you don't need a separate 'cnt' variable (you might also keep a
pointer to the next block for allocation reuse). Is that what you had in mind?

 -Hal
>
>
> This solution can be generalised for if there is code in between the
> recursive calls (just add more states). It's getting more complex in
> terms of control flow though - the question is, is it worth it?
>
>
> Unrolling this loop is probably not going to usefully create larger
> basic blocks due to how complex the control flow is. I suppose we
> have the opportunity, if we unroll, of CSE and jump threading values
> through equivalent blocks that may gain us time. However, we can
> still do LICM which is probably a major improvement, and there are
> no calls/returns to speculate.
>
>
> This approach would have to be extended if the recursive calls didn't
> return void (for example it was a recursive reduction), but it's not
> impossible.
>
>
> Sorry for the long post. What do people think?
>
>
> Cheers,
>
>
> James
>
>
> [1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/fb4a76d2/attachment.html>

mats petersson

2015-Feb-05 17:00 UTC

head link

[LLVMdev] RFC: Recursive inlining

Pablo, Most of the time the ABI only requires saving of registers that
are actually used in the function. So there may be cases where those
will not need saving, but at the same time, chances are that the loop
version will have to save/restore approximately the same amount of
context - and the caller-save registers [including return address] are
part of the callee context in recursion.

But yes, there is a small chance that the loop version saves on
"save/restore register on stack".

If the stack grows more than in actual recursion, then it's probably
questionable if the inlining/unwinding of the recursion is worth doing
- since most of the purpose here is to avoid the call/return and
argument passing - but if the code ends up saving/restoring MORE data,
it's unlikely to be of benefit [in my limited understanding of the
problem].

--
Mats

On 5 February 2015 at 16:18, Pablo Barrio <pablo.barrio at arm.com>
wrote:> Hi all,
>
>
>
> This thread is really interesting. I’m just trying to understand here, so
> please correct me if I’m wrong.
>
>
>
> Wouldn’t we potentially save stack usage just by turning the recursion into
> loops? For example, at the assembly level, the ABI often requires to
> preserve some registers between calls which translate into stack pushes,
and
> we will get rid of that. My understanding is that this transform does the
> same thing as the lowering to assembly does for function calls, with the
> exception that all the architecture-independent passes will have a broader
> context for analysis. So I would expect to even improve the stack usage.
>
>
>
> Otherwise, we could analyze the expected stack usage for a single recursive
> call. Then, compare it to the expected usage after the transformation for
> each loop iteration. This could be used as a heuristic to decide whether to
> do the conversion or not? E.g. we only allow to increase the stack growth
by
> 1.2x.
>
>
>
> Thanks,
>
> Pablo
>
>
>
> From: James Molloy [mailto:james at jamesmolloy.co.uk]
> Sent: 05 February 2015 15:19
> To: Hal Finkel
> Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio
> Subject: Re: RFC: Recursive inlining
>
>
>
> Hi Hal,
>
>
>
>> As we had briefly mentioned on IRC, one way of forming this
'stack', and
>> its associated 'cnt' variable, is to use dynamic stack
allocation.
>
>
>
> I hadn't really reached a decision on the mechanics yet. However, your
> suggestion while it can work within the current abilities of the IR, has
the
> disadvantage that it is using an extra stack slot for the link pointer. I
> think if we are going to apply this optimization, we must make sure we
> aren't going to blow the stack.
>
>
>
> As users may already be just close enough to the max-stack boundary that
> their applications don't crash, I think we should aim as much as
possible to
> keep the stack usage no higher than the recursive equivalent. Holding the
> "state" is the equivalent of a return address, so adding a stack
link
> pointer would be the equivalent of a frame address I suppose... That could
> be acceptable?
>
>
>
> As I say, I haven't thought about the deep mechanics just yet.
>
>
>
> James
>
>
>
> On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel <hfinkel at anl.gov>
wrote:
>
> ----- Original Message -----
>> From: "James Molloy" <james at jamesmolloy.co.uk>
>> To: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc
at google.com>, "Hal
>> Finkel" <hfinkel at anl.gov>, "pablo barrio"
>> <pablo.barrio at arm.com>
>> Sent: Thursday, February 5, 2015 4:36:16 AM
>> Subject: RFC: Recursive inlining
>>
>>
>> Hi,
>>
>>
>> The other day on IRC myself, Chandler, Hal and Pablo discussed
>> recursive inlining. I've pondered a bit since then, and thought
I'd
>> get some background on the list with the aim of stimulating
>> discussion.
>>
>>
>> Our goal is to inline recursive functions. We have several workloads
>> with recursive calls, and inlining them to a certain degree can
>> improve performance by up to 30%.
>>
>>
>> Inlining recursive functions is a slightly different problem to
>> inlining non-recursive functions in that the inlining can never
>> terminate, so you need to choose a factor (number of times) to
>> inline. In this respect it is similar to loop unrolling, and
>> Chandler suggested we should model it as such.
>>
>>
>> We originally thought he meant by this that the mechanics of inlining
>> would stay the same, but the heuristics would have to be changed.
>> This is the approach taken by Rugina & Rinard [1], which is almost
>> the sum of the literature on this topic I could find, which
>> surprised me. However, the actual suggestion was to remove the
>> recursion entirely, and replace it with a loop.
>>
>>
>> Before I get into the mechanics of this, I'd like to explain what
the
>> possible optimization opportunities are for recursive functions
>> turned into a loop.
>>
>>
>> 1. Unrolling
>> - This leads to bigger basic blocks (if they can be merged) and
>> exposes more pipeline-level parallelism.
>> - CSE between iterations.
>> - Reduces backbranch overhead (generally negligible impact).
>> 2. Commoning / LICM
>> - Recursion doesn't have a vehicle for moving common code out into
a
>> dominating block (compare with LICM for loops).
>>
>>
>> OK, first example:
>>
>>
>> void ex1(i) {
>> if (i == 0) return;
>> a[i] = i+1;
>> ex1(i-1);
>> }
>>
>>
>> gets converted to:
>>
>>
>> while (1) {
>> if (i == 0) break;
>> a[i] = i+1;
>> i = i-1;
>> }
>>
>>
>> Now, the above is the simplest case and is pure tail recursion. We
>> can eliminate this currently with the TailRecursionElimination pass.
>> The more complex case is where we have non-tail recusion and live
>> variables.
>>
>>
>> void ex2(i) {
>> if (i == 0) return;
>> j = a[i];
>> ex2(i-1)
>> b[j] = i;
>> }
>>
>>
>> This requires a stack being modelled and could be converted to:
>>
>>
>> cnt = 0;
>> while (1) {
>> if (i == 0) break;
>> j = a[i];
>> stack[cnt++] = j;
>> stack[cnt++] = i;
>> ++cnt;
>> --i;
>> }
>> while (cnt) {
>> i = stack[--cnt];
>> j = stack[--cnt];
>> b[j] = i;
>> }
>>
>>
>> The above is as far as the discussion got on IRC. The following are
>> my half-baked thoughts.
>>
>>
>> The above transform works because the access pattern to the stack is
>> strictly linear. Pushes are followed by pops. But I don't think
this
>> is the most interesting case for recursion in real programs. I see
>> (qualitatively, poorly) 3 reasons for recursion in code that should
>> go fast:
>>
>>
>> 1. A fibonnacci function in a microbenchmark or compiler shootout. I
>> think we can safely ignore this usecase...
>> 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
>> 3. Traversing a data structure, where the recursion is simply to get
>> to the leaves which is where all the fun stuff happens.
>>
>>
>> Points 2 and 3 both share a trait that they're dividing/fanning out
>> at each step:
>>
>>
>> void ex3(i) {
>> if (is_base(i)) {
>> f();
>> return;
>> }
>> g(i);
>> ex3(i / 2);
>> ex3(i / 2 + 1);
>> }
>>
>>
>> The access pattern of such a function is not linear. It is a
>> pre-order walk of a (binary, in this case) tree. As such we can't
>> use the two-loop transform above, but would have to form one loop
>> and just manually implement the stack operations:
>>
>>
>> cnt = 0
>> stack[cnt++] = i;
>> while (1) {
>> i = stack[--cnt];
>> if (is_base(i)) {
>> f(); continue;
>> }
>> g(i);
>> stack[cnt++] = i / 2 + 1;
>> stack[cnt++] = i / 2;
>> }
>>
>>
>> OK, it's still a well-formed loop, we can still do useful
operations
>> on it like LICM or, if it's profitable, unrolling.
>>
>>
>> Now what about a more complex case, like in the FFT butterfly
>> algorithm:
>>
>>
>> void ex4(i) {
>> if (is_base(i)) {
>> f(i); return;
>> }
>> g(i);
>> ex4(i / 2);
>> ex4(i / 2 + 1);
>> h(i);
>> }
>>
>>
>> Here, we have to do work after the last recursive call. This presents
>> a problem, because we don't have any of the caller's context
when
>> dealing with a node. Is a node N the first recursive call? or the
>> second? Call-stack recursion can save two things: the state of
>> variables and the return address which serves as another kind of
>> state. The obvious answer is to turn this into a sort of state
>> machine:
>>
>>
>> cnt = 0
>> stack[cnt++] = i;
>> stack[cnt++] = STATE_1;
>> while (1) {
>> state = stack[--cnt];
>> i = stack[--cnt];
>>
>>
>> if (state == STATE_FINISH) {
>> h(i);
>> continue;
>> }
>>
>>
>> if (is_base(i)) {
>> f(); continue;
>> }
>>
>>
>> g(i);
>> stack[cnt++] = i; // Push ourself again, in the final state this time
>> stack[cnt++] = STATE_FINISH;
>> stack[cnt++] = i / 2 + 1;
>> stack[cnt++] = STATE_1;
>> stack[cnt++] = i / 2;
>> stack[cnt++] = STATE_1;
>> }
>
> As we had briefly mentioned on IRC, one way of forming this
'stack', and its
> associated 'cnt' variable, is to use dynamic stack allocation. You
can do
> this in general at the IR level (using allocas that just happen not to be
in
> the entry block). If each stack block holds a pointer to the previous one
> (as a linked list), then you just need to maintain a pointer to the current
> block, and you don't need a separate 'cnt' variable (you might
also keep a
> pointer to the next block for allocation reuse). Is that what you had in
> mind?
>
>  -Hal
>
>>
>>
>> This solution can be generalised for if there is code in between the
>> recursive calls (just add more states). It's getting more complex
in
>> terms of control flow though - the question is, is it worth it?
>>
>>
>> Unrolling this loop is probably not going to usefully create larger
>> basic blocks due to how complex the control flow is. I suppose we
>> have the opportunity, if we unroll, of CSE and jump threading values
>> through equivalent blocks that may gain us time. However, we can
>> still do LICM which is probably a major improvement, and there are
>> no calls/returns to speculate.
>>
>>
>> This approach would have to be extended if the recursive calls
didn't
>> return void (for example it was a recursive reduction), but it's
not
>> impossible.
>>
>>
>> Sorry for the long post. What do people think?
>>
>>
>> Cheers,
>>
>>
>> James
>>
>>
>> [1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Hal Finkel

2015-Feb-06 02:34 UTC

head link

[LLVMdev] RFC: Recursive inlining

----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at
google.com>, "pablo barrio" <pablo.barrio at arm.com>
> Sent: Thursday, February 5, 2015 9:18:35 AM
> Subject: Re: RFC: Recursive inlining
> 
> 
> Hi Hal,
> 
> > As we had briefly mentioned on IRC, one way of forming this
> > 'stack', and its associated 'cnt' variable, is to use
dynamic
> > stack allocation.
> 
> 
> 
> I hadn't really reached a decision on the mechanics yet. However,
> your suggestion while it can work within the current abilities of
> the IR, has the disadvantage that it is using an extra stack slot
> for the link pointer. I think if we are going to apply this
> optimization, we must make sure we aren't going to blow the stack.
> 
> 
> As users may already be just close enough to the max-stack boundary
> that their applications don't crash, I think we should aim as much
> as possible to keep the stack usage no higher than the recursive
> equivalent. Holding the "state" is the equivalent of a return
> address, so adding a stack link pointer would be the equivalent of a
> frame address I suppose... That could be acceptable?
> 
I agree. We should aim not to increase the amount of stack space used. This will
require some target knowledge (because dynamic stack allocation has some
target-dependent overhead because of alignment and other ABI requirements).

 -Hal
> 
> As I say, I haven't thought about the deep mechanics just yet.
> 
> 
> James
> 
> On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> ----- Original Message -----
> > From: "James Molloy" < james at jamesmolloy.co.uk >
> > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" <
chandlerc at google.com
> > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo
barrio"
> > < pablo.barrio at arm.com >
> > Sent: Thursday, February 5, 2015 4:36:16 AM
> > Subject: RFC: Recursive inlining
> > 
> > 
> > Hi,
> > 
> > 
> > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > recursive inlining. I've pondered a bit since then, and thought
I'd
> > get some background on the list with the aim of stimulating
> > discussion.
> > 
> > 
> > Our goal is to inline recursive functions. We have several
> > workloads
> > with recursive calls, and inlining them to a certain degree can
> > improve performance by up to 30%.
> > 
> > 
> > Inlining recursive functions is a slightly different problem to
> > inlining non-recursive functions in that the inlining can never
> > terminate, so you need to choose a factor (number of times) to
> > inline. In this respect it is similar to loop unrolling, and
> > Chandler suggested we should model it as such.
> > 
> > 
> > We originally thought he meant by this that the mechanics of
> > inlining
> > would stay the same, but the heuristics would have to be changed.
> > This is the approach taken by Rugina & Rinard [1], which is almost
> > the sum of the literature on this topic I could find, which
> > surprised me. However, the actual suggestion was to remove the
> > recursion entirely, and replace it with a loop.
> > 
> > 
> > Before I get into the mechanics of this, I'd like to explain what
> > the
> > possible optimization opportunities are for recursive functions
> > turned into a loop.
> > 
> > 
> > 1. Unrolling
> > - This leads to bigger basic blocks (if they can be merged) and
> > exposes more pipeline-level parallelism.
> > - CSE between iterations.
> > - Reduces backbranch overhead (generally negligible impact).
> > 2. Commoning / LICM
> > - Recursion doesn't have a vehicle for moving common code out into
> > a
> > dominating block (compare with LICM for loops).
> > 
> > 
> > OK, first example:
> > 
> > 
> > void ex1(i) {
> > if (i == 0) return;
> > a[i] = i+1;
> > ex1(i-1);
> > }
> > 
> > 
> > gets converted to:
> > 
> > 
> > while (1) {
> > if (i == 0) break;
> > a[i] = i+1;
> > i = i-1;
> > }
> > 
> > 
> > Now, the above is the simplest case and is pure tail recursion. We
> > can eliminate this currently with the TailRecursionElimination
> > pass.
> > The more complex case is where we have non-tail recusion and live
> > variables.
> > 
> > 
> > void ex2(i) {
> > if (i == 0) return;
> > j = a[i];
> > ex2(i-1)
> > b[j] = i;
> > }
> > 
> > 
> > This requires a stack being modelled and could be converted to:
> > 
> > 
> > cnt = 0;
> > while (1) {
> > if (i == 0) break;
> > j = a[i];
> > stack[cnt++] = j;
> > stack[cnt++] = i;
> > ++cnt;
> > --i;
> > }
> > while (cnt) {
> > i = stack[--cnt];
> > j = stack[--cnt];
> > b[j] = i;
> > }
> > 
> > 
> > The above is as far as the discussion got on IRC. The following are
> > my half-baked thoughts.
> > 
> > 
> > The above transform works because the access pattern to the stack
> > is
> > strictly linear. Pushes are followed by pops. But I don't think
> > this
> > is the most interesting case for recursion in real programs. I see
> > (qualitatively, poorly) 3 reasons for recursion in code that should
> > go fast:
> > 
> > 
> > 1. A fibonnacci function in a microbenchmark or compiler shootout.
> > I
> > think we can safely ignore this usecase...
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > 3. Traversing a data structure, where the recursion is simply to
> > get
> > to the leaves which is where all the fun stuff happens.
> > 
> > 
> > Points 2 and 3 both share a trait that they're dividing/fanning
out
> > at each step:
> > 
> > 
> > void ex3(i) {
> > if (is_base(i)) {
> > f();
> > return;
> > }
> > g(i);
> > ex3(i / 2);
> > ex3(i / 2 + 1);
> > }
> > 
> > 
> > The access pattern of such a function is not linear. It is a
> > pre-order walk of a (binary, in this case) tree. As such we can't
> > use the two-loop transform above, but would have to form one loop
> > and just manually implement the stack operations:
> > 
> > 
> > cnt = 0
> > stack[cnt++] = i;
> > while (1) {
> > i = stack[--cnt];
> > if (is_base(i)) {
> > f(); continue;
> > }
> > g(i);
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = i / 2;
> > }
> > 
> > 
> > OK, it's still a well-formed loop, we can still do useful
> > operations
> > on it like LICM or, if it's profitable, unrolling.
> > 
> > 
> > Now what about a more complex case, like in the FFT butterfly
> > algorithm:
> > 
> > 
> > void ex4(i) {
> > if (is_base(i)) {
> > f(i); return;
> > }
> > g(i);
> > ex4(i / 2);
> > ex4(i / 2 + 1);
> > h(i);
> > }
> > 
> > 
> > Here, we have to do work after the last recursive call. This
> > presents
> > a problem, because we don't have any of the caller's context
when
> > dealing with a node. Is a node N the first recursive call? or the
> > second? Call-stack recursion can save two things: the state of
> > variables and the return address which serves as another kind of
> > state. The obvious answer is to turn this into a sort of state
> > machine:
> > 
> > 
> > cnt = 0
> > stack[cnt++] = i;
> > stack[cnt++] = STATE_1;
> > while (1) {
> > state = stack[--cnt];
> > i = stack[--cnt];
> > 
> > 
> > if (state == STATE_FINISH) {
> > h(i);
> > continue;
> > }
> > 
> > 
> > if (is_base(i)) {
> > f(); continue;
> > }
> > 
> > 
> > g(i);
> > stack[cnt++] = i; // Push ourself again, in the final state this
> > time
> > stack[cnt++] = STATE_FINISH;
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = STATE_1;
> > stack[cnt++] = i / 2;
> > stack[cnt++] = STATE_1;
> > }
> 
> As we had briefly mentioned on IRC, one way of forming this
'stack',
> and its associated 'cnt' variable, is to use dynamic stack
> allocation. You can do this in general at the IR level (using
> allocas that just happen not to be in the entry block). If each
> stack block holds a pointer to the previous one (as a linked list),
> then you just need to maintain a pointer to the current block, and
> you don't need a separate 'cnt' variable (you might also keep a
> pointer to the next block for allocation reuse). Is that what you
> had in mind?
> 
> -Hal
> 
> > 
> > 
> > This solution can be generalised for if there is code in between
> > the
> > recursive calls (just add more states). It's getting more complex
> > in
> > terms of control flow though - the question is, is it worth it?
> > 
> > 
> > Unrolling this loop is probably not going to usefully create larger
> > basic blocks due to how complex the control flow is. I suppose we
> > have the opportunity, if we unroll, of CSE and jump threading
> > values
> > through equivalent blocks that may gain us time. However, we can
> > still do LICM which is probably a major improvement, and there are
> > no calls/returns to speculate.
> > 
> > 
> > This approach would have to be extended if the recursive calls
> > didn't
> > return void (for example it was a recursive reduction), but it's
> > not
> > impossible.
> > 
> > 
> > Sorry for the long post. What do people think?
> > 
> > 
> > Cheers,
> > 
> > 
> > James
> > 
> > 
> > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Rafael Auler

2015-Feb-18 19:40 UTC

head link

[LLVMdev] RFC: Recursive inlining

Sorry for jumping late into this thread, I'm quite late in my mail reading
after CGO. Here's my experience with this topic (I presented this on a
binary translation workshop of CGO):

I sort of stumbled upon this transformation by accident in the last year. I
do research with binary translation and LLVM -- I have a prototype that
reads MIPS binary code and mimics the behavior of a MIPS guest machine with
LLVM IR code, and then use an LLVM backend to emulate the guest code on
other host machines (x86 and ARM).

I guess you already see where this is going after all this state machine
discussion. One day I decided to experiment with MIPS function calls being
modelled in the IR as jumps to basic blocks instead of calls to LLVM
functions. In this way, the whole MIPS program would be translated into a
single LLVM function, removing any recursion, if present.

After this, two of my very simple benchmarks (Fibonacci and Ackermann, two
recursive ones) began to show better performance *after* being compiled to
MIPS and then retranslated to x86, which didn't make sense, because binary
translation always imposes some overhead that comes from emulating the
behavior of another machine. Upon further investigation, the culprit was
that the recursive functions were transformed into a loop in a way that is
similar to what you described: the stack you refer to is my MIPS guest
stack that I need to emulate on top of x86. Nothing special, I'm just
emulating a processor in software, so, upon hitting a return instruction, I
check the return address to decide to which basic block to jump to (whether
it's time to exit the loop or not), similar to your control.

The performance gain was so expressive that it was mitigating the binary
translator overhead and yet beating native x86 performance by 2x in the
first implementation (LLVM 3.3 running in an old Intel Core2quad). Now in
trunk and running on a recent processor, it is a much more modest speedup
of 20%. But I only observed this for simple benchmarks, such as Fibonacci
and Ackermann. If you didn't implement this transformation yet and you are
curious about its impact, I can test this framework with other simple
workloads, as long as it doesn't break my prototype :-) It's not ideal
though, since it has a considerable overhead from emulating a guest MIPS
platform.

However, when translating to ARM, I could not observe the same gains seen
on x86 because the way I performed the binary translation imposed a higher
overhead on ARM (Cortex A9) platforms than on x86 (Haswell) ones.

Here are the results for LLVM trunk of Jan 19 (my last rebase). I'm using
the full -O3 opt pipeline after I translate the code:

Fibonacci:
Native x86:  1.56s
Translated MIPS to x86:   1.26s  (-20.75%)

Native ARM:  6.42s
Translated MIPS to ARM: 6.49s  (overhead of 9%)

Ackermann:
Native x86:  0.94s
Translated MIPS to x86: 0.57s (-38.30%)

Regards,
Rafael Auler


On Fri, Feb 6, 2015 at 12:34 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc
at google.com>,
> "pablo barrio" <pablo.barrio at arm.com>
> > Sent: Thursday, February 5, 2015 9:18:35 AM
> > Subject: Re: RFC: Recursive inlining
> >
> >
> > Hi Hal,
> >
> > > As we had briefly mentioned on IRC, one way of forming this
> > > 'stack', and its associated 'cnt' variable, is to
use dynamic
> > > stack allocation.
> >
> >
> >
> > I hadn't really reached a decision on the mechanics yet. However,
> > your suggestion while it can work within the current abilities of
> > the IR, has the disadvantage that it is using an extra stack slot
> > for the link pointer. I think if we are going to apply this
> > optimization, we must make sure we aren't going to blow the stack.
> >
> >
> > As users may already be just close enough to the max-stack boundary
> > that their applications don't crash, I think we should aim as much
> > as possible to keep the stack usage no higher than the recursive
> > equivalent. Holding the "state" is the equivalent of a
return
> > address, so adding a stack link pointer would be the equivalent of a
> > frame address I suppose... That could be acceptable?
> >
>
> I agree. We should aim not to increase the amount of stack space used.
> This will require some target knowledge (because dynamic stack allocation
> has some target-dependent overhead because of alignment and other ABI
> requirements).
>
>  -Hal
>
> >
> > As I say, I haven't thought about the deep mechanics just yet.
> >
> >
> > James
> >
> > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov
>
> > wrote:
> >
> >
> > ----- Original Message -----
> > > From: "James Molloy" < james at jamesmolloy.co.uk
>
> > > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" <
chandlerc at google.com
> > > >, "Hal Finkel" < hfinkel at anl.gov >,
"pablo barrio"
> > > < pablo.barrio at arm.com >
> > > Sent: Thursday, February 5, 2015 4:36:16 AM
> > > Subject: RFC: Recursive inlining
> > >
> > >
> > > Hi,
> > >
> > >
> > > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > > recursive inlining. I've pondered a bit since then, and
thought I'd
> > > get some background on the list with the aim of stimulating
> > > discussion.
> > >
> > >
> > > Our goal is to inline recursive functions. We have several
> > > workloads
> > > with recursive calls, and inlining them to a certain degree can
> > > improve performance by up to 30%.
> > >
> > >
> > > Inlining recursive functions is a slightly different problem to
> > > inlining non-recursive functions in that the inlining can never
> > > terminate, so you need to choose a factor (number of times) to
> > > inline. In this respect it is similar to loop unrolling, and
> > > Chandler suggested we should model it as such.
> > >
> > >
> > > We originally thought he meant by this that the mechanics of
> > > inlining
> > > would stay the same, but the heuristics would have to be changed.
> > > This is the approach taken by Rugina & Rinard [1], which is
almost
> > > the sum of the literature on this topic I could find, which
> > > surprised me. However, the actual suggestion was to remove the
> > > recursion entirely, and replace it with a loop.
> > >
> > >
> > > Before I get into the mechanics of this, I'd like to explain
what
> > > the
> > > possible optimization opportunities are for recursive functions
> > > turned into a loop.
> > >
> > >
> > > 1. Unrolling
> > > - This leads to bigger basic blocks (if they can be merged) and
> > > exposes more pipeline-level parallelism.
> > > - CSE between iterations.
> > > - Reduces backbranch overhead (generally negligible impact).
> > > 2. Commoning / LICM
> > > - Recursion doesn't have a vehicle for moving common code out
into
> > > a
> > > dominating block (compare with LICM for loops).
> > >
> > >
> > > OK, first example:
> > >
> > >
> > > void ex1(i) {
> > > if (i == 0) return;
> > > a[i] = i+1;
> > > ex1(i-1);
> > > }
> > >
> > >
> > > gets converted to:
> > >
> > >
> > > while (1) {
> > > if (i == 0) break;
> > > a[i] = i+1;
> > > i = i-1;
> > > }
> > >
> > >
> > > Now, the above is the simplest case and is pure tail recursion.
We
> > > can eliminate this currently with the TailRecursionElimination
> > > pass.
> > > The more complex case is where we have non-tail recusion and live
> > > variables.
> > >
> > >
> > > void ex2(i) {
> > > if (i == 0) return;
> > > j = a[i];
> > > ex2(i-1)
> > > b[j] = i;
> > > }
> > >
> > >
> > > This requires a stack being modelled and could be converted to:
> > >
> > >
> > > cnt = 0;
> > > while (1) {
> > > if (i == 0) break;
> > > j = a[i];
> > > stack[cnt++] = j;
> > > stack[cnt++] = i;
> > > ++cnt;
> > > --i;
> > > }
> > > while (cnt) {
> > > i = stack[--cnt];
> > > j = stack[--cnt];
> > > b[j] = i;
> > > }
> > >
> > >
> > > The above is as far as the discussion got on IRC. The following
are
> > > my half-baked thoughts.
> > >
> > >
> > > The above transform works because the access pattern to the stack
> > > is
> > > strictly linear. Pushes are followed by pops. But I don't
think
> > > this
> > > is the most interesting case for recursion in real programs. I
see
> > > (qualitatively, poorly) 3 reasons for recursion in code that
should
> > > go fast:
> > >
> > >
> > > 1. A fibonnacci function in a microbenchmark or compiler
shootout.
> > > I
> > > think we can safely ignore this usecase...
> > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies,
etc.
> > > 3. Traversing a data structure, where the recursion is simply to
> > > get
> > > to the leaves which is where all the fun stuff happens.
> > >
> > >
> > > Points 2 and 3 both share a trait that they're
dividing/fanning out
> > > at each step:
> > >
> > >
> > > void ex3(i) {
> > > if (is_base(i)) {
> > > f();
> > > return;
> > > }
> > > g(i);
> > > ex3(i / 2);
> > > ex3(i / 2 + 1);
> > > }
> > >
> > >
> > > The access pattern of such a function is not linear. It is a
> > > pre-order walk of a (binary, in this case) tree. As such we
can't
> > > use the two-loop transform above, but would have to form one loop
> > > and just manually implement the stack operations:
> > >
> > >
> > > cnt = 0
> > > stack[cnt++] = i;
> > > while (1) {
> > > i = stack[--cnt];
> > > if (is_base(i)) {
> > > f(); continue;
> > > }
> > > g(i);
> > > stack[cnt++] = i / 2 + 1;
> > > stack[cnt++] = i / 2;
> > > }
> > >
> > >
> > > OK, it's still a well-formed loop, we can still do useful
> > > operations
> > > on it like LICM or, if it's profitable, unrolling.
> > >
> > >
> > > Now what about a more complex case, like in the FFT butterfly
> > > algorithm:
> > >
> > >
> > > void ex4(i) {
> > > if (is_base(i)) {
> > > f(i); return;
> > > }
> > > g(i);
> > > ex4(i / 2);
> > > ex4(i / 2 + 1);
> > > h(i);
> > > }
> > >
> > >
> > > Here, we have to do work after the last recursive call. This
> > > presents
> > > a problem, because we don't have any of the caller's
context when
> > > dealing with a node. Is a node N the first recursive call? or the
> > > second? Call-stack recursion can save two things: the state of
> > > variables and the return address which serves as another kind of
> > > state. The obvious answer is to turn this into a sort of state
> > > machine:
> > >
> > >
> > > cnt = 0
> > > stack[cnt++] = i;
> > > stack[cnt++] = STATE_1;
> > > while (1) {
> > > state = stack[--cnt];
> > > i = stack[--cnt];
> > >
> > >
> > > if (state == STATE_FINISH) {
> > > h(i);
> > > continue;
> > > }
> > >
> > >
> > > if (is_base(i)) {
> > > f(); continue;
> > > }
> > >
> > >
> > > g(i);
> > > stack[cnt++] = i; // Push ourself again, in the final state this
> > > time
> > > stack[cnt++] = STATE_FINISH;
> > > stack[cnt++] = i / 2 + 1;
> > > stack[cnt++] = STATE_1;
> > > stack[cnt++] = i / 2;
> > > stack[cnt++] = STATE_1;
> > > }
> >
> > As we had briefly mentioned on IRC, one way of forming this
'stack',
> > and its associated 'cnt' variable, is to use dynamic stack
> > allocation. You can do this in general at the IR level (using
> > allocas that just happen not to be in the entry block). If each
> > stack block holds a pointer to the previous one (as a linked list),
> > then you just need to maintain a pointer to the current block, and
> > you don't need a separate 'cnt' variable (you might also
keep a
> > pointer to the next block for allocation reuse). Is that what you
> > had in mind?
> >
> > -Hal
> >
> > >
> > >
> > > This solution can be generalised for if there is code in between
> > > the
> > > recursive calls (just add more states). It's getting more
complex
> > > in
> > > terms of control flow though - the question is, is it worth it?
> > >
> > >
> > > Unrolling this loop is probably not going to usefully create
larger
> > > basic blocks due to how complex the control flow is. I suppose we
> > > have the opportunity, if we unroll, of CSE and jump threading
> > > values
> > > through equivalent blocks that may gain us time. However, we can
> > > still do LICM which is probably a major improvement, and there
are
> > > no calls/returns to speculate.
> > >
> > >
> > > This approach would have to be extended if the recursive calls
> > > didn't
> > > return void (for example it was a recursive reduction), but
it's
> > > not
> > > impossible.
> > >
> > >
> > > Sorry for the long post. What do people think?
> > >
> > >
> > > Cheers,
> > >
> > >
> > > James
> > >
> > >
> > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150218/06e33938/attachment.html>

Pablo Barrio

2015-Feb-18 20:57 UTC

head link

[LLVMdev] RFC: Recursive inlining

Hi,

Apologies for the very late response.

We have manually tried the idea with a very simple Fibonacci sequence code.
While being very very simple, the recursion cannot be handled by TRE. Because
there are two recursive callsites, it also needs to keep some sort of state
across iterations of the "while(stack not empty)" loop.

We get between 2.5 and 8x slowdowns depending on which processor, and the branch
mispredictions are huge (specially in x86: around 100 times higher). We are
replacing the function call mechanism by a stack + switch statement to mimic the
control flow for callsites. My understanding is that the branches in the switch
are hardly predictable: they depend on a value taken from the stack at each
iteration. Things could get harder as we increase the number of recursive
callsites in a function (we increase the size of the switch).

Also, virtually all processors beyond a certain level of sophistication
(let's say, meant to run an OS) feature a return-stack prediction mechanism.
This results in highly-efficient branch prediction for calls+returns. Since we
are transforming the "call jumps" into "switch jumps", we
are probably misusing that piece of hardware.

Of course, this is not to say that this is a dead end. I paste the code here
(original and "unrolled" version) so that you see the problem, and
maybe you can spot something that I missed. Put your favorite queue
implementation in there.

Cheers,
Pablo


unsigned fibonacci(unsigned n){

  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;
  else
    return fibonacci(n - 1) + fibonacci(n - 2);
}

unsigned fibonacci_opt(unsigned n){

  // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call
  enum {S0, S1, S2};

  Stack s;
  int ret;
  init(&s, 256);
  push(&s, n);
  push(&s, S0);

  while (!empty(s)){

    /* print(&s); */

    int state = pop(&s);
    n = pop(&s);

    int a, b;

    switch(state){
    case S0:

      if (n <= 0)
        ret = 0;

      else if (n == 1)
        ret = 1;

      else{
        // On "fake return" we need to move to the end of the first
"recursive call".
        push(&s, n);
        push(&s, S1);

        // Go for the first "recursive call"
        push(&s, n - 1);
        push(&s, S0);
      }
      break;

    case S1:
      // On "fake return", we want to recover the return of the
previous call.
      push(&s, ret);
      push(&s, S2);

      // Go for the second "recursive call"
      push(&s, n - 2);
      push(&s, S0);
      break;

    case S2:
      ret += n;
    }
  }

  delete(&s);
  return ret;
}

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: 06 February 2015 02:35
To: James Molloy
Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio
Subject: Re: RFC: Recursive inlining

----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at
google.com>,
> "pablo barrio" <pablo.barrio at arm.com>
> Sent: Thursday, February 5, 2015 9:18:35 AM
> Subject: Re: RFC: Recursive inlining
>
>
> Hi Hal,
>
> > As we had briefly mentioned on IRC, one way of forming this
'stack',
> > and its associated 'cnt' variable, is to use dynamic stack
> > allocation.
>
>
>
> I hadn't really reached a decision on the mechanics yet. However, your
> suggestion while it can work within the current abilities of the IR,
> has the disadvantage that it is using an extra stack slot for the link
> pointer. I think if we are going to apply this optimization, we must
> make sure we aren't going to blow the stack.
>
>
> As users may already be just close enough to the max-stack boundary
> that their applications don't crash, I think we should aim as much as
> possible to keep the stack usage no higher than the recursive
> equivalent. Holding the "state" is the equivalent of a return
address,
> so adding a stack link pointer would be the equivalent of a frame
> address I suppose... That could be acceptable?
>
I agree. We should aim not to increase the amount of stack space used. This will
require some target knowledge (because dynamic stack allocation has some
target-dependent overhead because of alignment and other ABI requirements).

 -Hal
>
> As I say, I haven't thought about the deep mechanics just yet.
>
>
> James
>
> On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov >
> wrote:
>
>
> ----- Original Message -----
> > From: "James Molloy" < james at jamesmolloy.co.uk >
> > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" <
chandlerc at google.com
> > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo
barrio"
> > < pablo.barrio at arm.com >
> > Sent: Thursday, February 5, 2015 4:36:16 AM
> > Subject: RFC: Recursive inlining
> >
> >
> > Hi,
> >
> >
> > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > recursive inlining. I've pondered a bit since then, and thought
I'd
> > get some background on the list with the aim of stimulating
> > discussion.
> >
> >
> > Our goal is to inline recursive functions. We have several workloads
> > with recursive calls, and inlining them to a certain degree can
> > improve performance by up to 30%.
> >
> >
> > Inlining recursive functions is a slightly different problem to
> > inlining non-recursive functions in that the inlining can never
> > terminate, so you need to choose a factor (number of times) to
> > inline. In this respect it is similar to loop unrolling, and
> > Chandler suggested we should model it as such.
> >
> >
> > We originally thought he meant by this that the mechanics of
> > inlining would stay the same, but the heuristics would have to be
> > changed.
> > This is the approach taken by Rugina & Rinard [1], which is almost
> > the sum of the literature on this topic I could find, which
> > surprised me. However, the actual suggestion was to remove the
> > recursion entirely, and replace it with a loop.
> >
> >
> > Before I get into the mechanics of this, I'd like to explain what
> > the possible optimization opportunities are for recursive functions
> > turned into a loop.
> >
> >
> > 1. Unrolling
> > - This leads to bigger basic blocks (if they can be merged) and
> > exposes more pipeline-level parallelism.
> > - CSE between iterations.
> > - Reduces backbranch overhead (generally negligible impact).
> > 2. Commoning / LICM
> > - Recursion doesn't have a vehicle for moving common code out into
a
> > dominating block (compare with LICM for loops).
> >
> >
> > OK, first example:
> >
> >
> > void ex1(i) {
> > if (i == 0) return;
> > a[i] = i+1;
> > ex1(i-1);
> > }
> >
> >
> > gets converted to:
> >
> >
> > while (1) {
> > if (i == 0) break;
> > a[i] = i+1;
> > i = i-1;
> > }
> >
> >
> > Now, the above is the simplest case and is pure tail recursion. We
> > can eliminate this currently with the TailRecursionElimination pass.
> > The more complex case is where we have non-tail recusion and live
> > variables.
> >
> >
> > void ex2(i) {
> > if (i == 0) return;
> > j = a[i];
> > ex2(i-1)
> > b[j] = i;
> > }
> >
> >
> > This requires a stack being modelled and could be converted to:
> >
> >
> > cnt = 0;
> > while (1) {
> > if (i == 0) break;
> > j = a[i];
> > stack[cnt++] = j;
> > stack[cnt++] = i;
> > ++cnt;
> > --i;
> > }
> > while (cnt) {
> > i = stack[--cnt];
> > j = stack[--cnt];
> > b[j] = i;
> > }
> >
> >
> > The above is as far as the discussion got on IRC. The following are
> > my half-baked thoughts.
> >
> >
> > The above transform works because the access pattern to the stack is
> > strictly linear. Pushes are followed by pops. But I don't think
this
> > is the most interesting case for recursion in real programs. I see
> > (qualitatively, poorly) 3 reasons for recursion in code that should
> > go fast:
> >
> >
> > 1. A fibonnacci function in a microbenchmark or compiler shootout.
> > I
> > think we can safely ignore this usecase...
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > 3. Traversing a data structure, where the recursion is simply to get
> > to the leaves which is where all the fun stuff happens.
> >
> >
> > Points 2 and 3 both share a trait that they're dividing/fanning
out
> > at each step:
> >
> >
> > void ex3(i) {
> > if (is_base(i)) {
> > f();
> > return;
> > }
> > g(i);
> > ex3(i / 2);
> > ex3(i / 2 + 1);
> > }
> >
> >
> > The access pattern of such a function is not linear. It is a
> > pre-order walk of a (binary, in this case) tree. As such we can't
> > use the two-loop transform above, but would have to form one loop
> > and just manually implement the stack operations:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > while (1) {
> > i = stack[--cnt];
> > if (is_base(i)) {
> > f(); continue;
> > }
> > g(i);
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = i / 2;
> > }
> >
> >
> > OK, it's still a well-formed loop, we can still do useful
operations
> > on it like LICM or, if it's profitable, unrolling.
> >
> >
> > Now what about a more complex case, like in the FFT butterfly
> > algorithm:
> >
> >
> > void ex4(i) {
> > if (is_base(i)) {
> > f(i); return;
> > }
> > g(i);
> > ex4(i / 2);
> > ex4(i / 2 + 1);
> > h(i);
> > }
> >
> >
> > Here, we have to do work after the last recursive call. This
> > presents a problem, because we don't have any of the caller's
> > context when dealing with a node. Is a node N the first recursive
> > call? or the second? Call-stack recursion can save two things: the
> > state of variables and the return address which serves as another
> > kind of state. The obvious answer is to turn this into a sort of
> > state
> > machine:
> >
> >
> > cnt = 0
> > stack[cnt++] = i;
> > stack[cnt++] = STATE_1;
> > while (1) {
> > state = stack[--cnt];
> > i = stack[--cnt];
> >
> >
> > if (state == STATE_FINISH) {
> > h(i);
> > continue;
> > }
> >
> >
> > if (is_base(i)) {
> > f(); continue;
> > }
> >
> >
> > g(i);
> > stack[cnt++] = i; // Push ourself again, in the final state this
> > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] > >
STATE_1; }
>
> As we had briefly mentioned on IRC, one way of forming this
'stack',
> and its associated 'cnt' variable, is to use dynamic stack
allocation.
> You can do this in general at the IR level (using allocas that just
> happen not to be in the entry block). If each stack block holds a
> pointer to the previous one (as a linked list), then you just need to
> maintain a pointer to the current block, and you don't need a separate
> 'cnt' variable (you might also keep a pointer to the next block for
> allocation reuse). Is that what you had in mind?
>
> -Hal
>
> >
> >
> > This solution can be generalised for if there is code in between the
> > recursive calls (just add more states). It's getting more complex
in
> > terms of control flow though - the question is, is it worth it?
> >
> >
> > Unrolling this loop is probably not going to usefully create larger
> > basic blocks due to how complex the control flow is. I suppose we
> > have the opportunity, if we unroll, of CSE and jump threading values
> > through equivalent blocks that may gain us time. However, we can
> > still do LICM which is probably a major improvement, and there are
> > no calls/returns to speculate.
> >
> >
> > This approach would have to be extended if the recursive calls
> > didn't return void (for example it was a recursive reduction), but
> > it's not impossible.
> >
> >
> > Sorry for the long post. What do people think?
> >
> >
> > Cheers,
> >
> >
> > James
> >
> >
> > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered
in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No:  2548782

llvm dev - Feb 2015 - [LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining

[LLVMdev] RFC: Recursive inlining