Hi, Apologies for the very late response. We have manually tried the idea with a very simple Fibonacci sequence code. While being very very simple, the recursion cannot be handled by TRE. Because there are two recursive callsites, it also needs to keep some sort of state across iterations of the "while(stack not empty)" loop. We get between 2.5 and 8x slowdowns depending on which processor, and the branch mispredictions are huge (specially in x86: around 100 times higher). We are replacing the function call mechanism by a stack + switch statement to mimic the control flow for callsites. My understanding is that the branches in the switch are hardly predictable: they depend on a value taken from the stack at each iteration. Things could get harder as we increase the number of recursive callsites in a function (we increase the size of the switch). Also, virtually all processors beyond a certain level of sophistication (let's say, meant to run an OS) feature a return-stack prediction mechanism. This results in highly-efficient branch prediction for calls+returns. Since we are transforming the "call jumps" into "switch jumps", we are probably misusing that piece of hardware. Of course, this is not to say that this is a dead end. I paste the code here (original and "unrolled" version) so that you see the problem, and maybe you can spot something that I missed. Put your favorite queue implementation in there. Cheers, Pablo unsigned fibonacci(unsigned n){ if (n == 0) return 0; else if (n == 1) return 1; else return fibonacci(n - 1) + fibonacci(n - 2); } unsigned fibonacci_opt(unsigned n){ // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call enum {S0, S1, S2}; Stack s; int ret; init(&s, 256); push(&s, n); push(&s, S0); while (!empty(s)){ /* print(&s); */ int state = pop(&s); n = pop(&s); int a, b; switch(state){ case S0: if (n <= 0) ret = 0; else if (n == 1) ret = 1; else{ // On "fake return" we need to move to the end of the first "recursive call". push(&s, n); push(&s, S1); // Go for the first "recursive call" push(&s, n - 1); push(&s, S0); } break; case S1: // On "fake return", we want to recover the return of the previous call. push(&s, ret); push(&s, S2); // Go for the second "recursive call" push(&s, n - 2); push(&s, S0); break; case S2: ret += n; } } delete(&s); return ret; } -----Original Message----- From: Hal Finkel [mailto:hfinkel at anl.gov] Sent: 06 February 2015 02:35 To: James Molloy Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio Subject: Re: RFC: Recursive inlining ----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at google.com>, > "pablo barrio" <pablo.barrio at arm.com> > Sent: Thursday, February 5, 2015 9:18:35 AM > Subject: Re: RFC: Recursive inlining > > > Hi Hal, > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > and its associated 'cnt' variable, is to use dynamic stack > > allocation. > > > > I hadn't really reached a decision on the mechanics yet. However, your > suggestion while it can work within the current abilities of the IR, > has the disadvantage that it is using an extra stack slot for the link > pointer. I think if we are going to apply this optimization, we must > make sure we aren't going to blow the stack. > > > As users may already be just close enough to the max-stack boundary > that their applications don't crash, I think we should aim as much as > possible to keep the stack usage no higher than the recursive > equivalent. Holding the "state" is the equivalent of a return address, > so adding a stack link pointer would be the equivalent of a frame > address I suppose... That could be acceptable? >I agree. We should aim not to increase the amount of stack space used. This will require some target knowledge (because dynamic stack allocation has some target-dependent overhead because of alignment and other ABI requirements). -Hal> > As I say, I haven't thought about the deep mechanics just yet. > > > James > > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov > > wrote: > > > ----- Original Message ----- > > From: "James Molloy" < james at jamesmolloy.co.uk > > > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" < chandlerc at google.com > > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo barrio" > > < pablo.barrio at arm.com > > > Sent: Thursday, February 5, 2015 4:36:16 AM > > Subject: RFC: Recursive inlining > > > > > > Hi, > > > > > > The other day on IRC myself, Chandler, Hal and Pablo discussed > > recursive inlining. I've pondered a bit since then, and thought I'd > > get some background on the list with the aim of stimulating > > discussion. > > > > > > Our goal is to inline recursive functions. We have several workloads > > with recursive calls, and inlining them to a certain degree can > > improve performance by up to 30%. > > > > > > Inlining recursive functions is a slightly different problem to > > inlining non-recursive functions in that the inlining can never > > terminate, so you need to choose a factor (number of times) to > > inline. In this respect it is similar to loop unrolling, and > > Chandler suggested we should model it as such. > > > > > > We originally thought he meant by this that the mechanics of > > inlining would stay the same, but the heuristics would have to be > > changed. > > This is the approach taken by Rugina & Rinard [1], which is almost > > the sum of the literature on this topic I could find, which > > surprised me. However, the actual suggestion was to remove the > > recursion entirely, and replace it with a loop. > > > > > > Before I get into the mechanics of this, I'd like to explain what > > the possible optimization opportunities are for recursive functions > > turned into a loop. > > > > > > 1. Unrolling > > - This leads to bigger basic blocks (if they can be merged) and > > exposes more pipeline-level parallelism. > > - CSE between iterations. > > - Reduces backbranch overhead (generally negligible impact). > > 2. Commoning / LICM > > - Recursion doesn't have a vehicle for moving common code out into a > > dominating block (compare with LICM for loops). > > > > > > OK, first example: > > > > > > void ex1(i) { > > if (i == 0) return; > > a[i] = i+1; > > ex1(i-1); > > } > > > > > > gets converted to: > > > > > > while (1) { > > if (i == 0) break; > > a[i] = i+1; > > i = i-1; > > } > > > > > > Now, the above is the simplest case and is pure tail recursion. We > > can eliminate this currently with the TailRecursionElimination pass. > > The more complex case is where we have non-tail recusion and live > > variables. > > > > > > void ex2(i) { > > if (i == 0) return; > > j = a[i]; > > ex2(i-1) > > b[j] = i; > > } > > > > > > This requires a stack being modelled and could be converted to: > > > > > > cnt = 0; > > while (1) { > > if (i == 0) break; > > j = a[i]; > > stack[cnt++] = j; > > stack[cnt++] = i; > > ++cnt; > > --i; > > } > > while (cnt) { > > i = stack[--cnt]; > > j = stack[--cnt]; > > b[j] = i; > > } > > > > > > The above is as far as the discussion got on IRC. The following are > > my half-baked thoughts. > > > > > > The above transform works because the access pattern to the stack is > > strictly linear. Pushes are followed by pops. But I don't think this > > is the most interesting case for recursion in real programs. I see > > (qualitatively, poorly) 3 reasons for recursion in code that should > > go fast: > > > > > > 1. A fibonnacci function in a microbenchmark or compiler shootout. > > I > > think we can safely ignore this usecase... > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc. > > 3. Traversing a data structure, where the recursion is simply to get > > to the leaves which is where all the fun stuff happens. > > > > > > Points 2 and 3 both share a trait that they're dividing/fanning out > > at each step: > > > > > > void ex3(i) { > > if (is_base(i)) { > > f(); > > return; > > } > > g(i); > > ex3(i / 2); > > ex3(i / 2 + 1); > > } > > > > > > The access pattern of such a function is not linear. It is a > > pre-order walk of a (binary, in this case) tree. As such we can't > > use the two-loop transform above, but would have to form one loop > > and just manually implement the stack operations: > > > > > > cnt = 0 > > stack[cnt++] = i; > > while (1) { > > i = stack[--cnt]; > > if (is_base(i)) { > > f(); continue; > > } > > g(i); > > stack[cnt++] = i / 2 + 1; > > stack[cnt++] = i / 2; > > } > > > > > > OK, it's still a well-formed loop, we can still do useful operations > > on it like LICM or, if it's profitable, unrolling. > > > > > > Now what about a more complex case, like in the FFT butterfly > > algorithm: > > > > > > void ex4(i) { > > if (is_base(i)) { > > f(i); return; > > } > > g(i); > > ex4(i / 2); > > ex4(i / 2 + 1); > > h(i); > > } > > > > > > Here, we have to do work after the last recursive call. This > > presents a problem, because we don't have any of the caller's > > context when dealing with a node. Is a node N the first recursive > > call? or the second? Call-stack recursion can save two things: the > > state of variables and the return address which serves as another > > kind of state. The obvious answer is to turn this into a sort of > > state > > machine: > > > > > > cnt = 0 > > stack[cnt++] = i; > > stack[cnt++] = STATE_1; > > while (1) { > > state = stack[--cnt]; > > i = stack[--cnt]; > > > > > > if (state == STATE_FINISH) { > > h(i); > > continue; > > } > > > > > > if (is_base(i)) { > > f(); continue; > > } > > > > > > g(i); > > stack[cnt++] = i; // Push ourself again, in the final state this > > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1; > > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] > > STATE_1; } > > As we had briefly mentioned on IRC, one way of forming this 'stack', > and its associated 'cnt' variable, is to use dynamic stack allocation. > You can do this in general at the IR level (using allocas that just > happen not to be in the entry block). If each stack block holds a > pointer to the previous one (as a linked list), then you just need to > maintain a pointer to the current block, and you don't need a separate > 'cnt' variable (you might also keep a pointer to the next block for > allocation reuse). Is that what you had in mind? > > -Hal > > > > > > > This solution can be generalised for if there is code in between the > > recursive calls (just add more states). It's getting more complex in > > terms of control flow though - the question is, is it worth it? > > > > > > Unrolling this loop is probably not going to usefully create larger > > basic blocks due to how complex the control flow is. I suppose we > > have the opportunity, if we unroll, of CSE and jump threading values > > through equivalent blocks that may gain us time. However, we can > > still do LICM which is probably a major improvement, and there are > > no calls/returns to speculate. > > > > > > This approach would have to be extended if the recursive calls > > didn't return void (for example it was a recursive reduction), but > > it's not impossible. > > > > > > Sorry for the long post. What do people think? > > > > > > Cheers, > > > > > > James > > > > > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3 > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
On Wed, Feb 18, 2015 at 12:57 PM, Pablo Barrio <Pablo.Barrio at arm.com> wrote:> Also, virtually all processors beyond a certain level of sophistication > (let's say, meant to run an OS) feature a return-stack prediction > mechanism. This results in highly-efficient branch prediction for > calls+returns. Since we are transforming the "call jumps" into "switch > jumps", we are probably misusing that piece of hardware.Yes, it will be critical to use a branching pattern that similarly fits modern predictors. I don't think this will be hard though. Again, I'm continuing to work on a prototype actual pass. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150218/178c891e/attachment.html>
I'm not completely sure, but you are probably getting a slowdown here because LLVM transforms the native Fibonacci with a tail call optimization (not elimination, but it just rewrites the recursion) that greatly improves Fibonacci, and it is unable to apply this transformation to your explicit- stack implementation. To see this, compile your regular fibonacci and checkout the x86 assembly to see how LLVM transforms it. On Wed, Feb 18, 2015 at 6:57 PM, Pablo Barrio <Pablo.Barrio at arm.com> wrote:> Hi, > > Apologies for the very late response. > > We have manually tried the idea with a very simple Fibonacci sequence > code. While being very very simple, the recursion cannot be handled by TRE. > Because there are two recursive callsites, it also needs to keep some sort > of state across iterations of the "while(stack not empty)" loop. > > We get between 2.5 and 8x slowdowns depending on which processor, and the > branch mispredictions are huge (specially in x86: around 100 times higher). > We are replacing the function call mechanism by a stack + switch statement > to mimic the control flow for callsites. My understanding is that the > branches in the switch are hardly predictable: they depend on a value taken > from the stack at each iteration. Things could get harder as we increase > the number of recursive callsites in a function (we increase the size of > the switch). > > Also, virtually all processors beyond a certain level of sophistication > (let's say, meant to run an OS) feature a return-stack prediction > mechanism. This results in highly-efficient branch prediction for > calls+returns. Since we are transforming the "call jumps" into "switch > jumps", we are probably misusing that piece of hardware. > > Of course, this is not to say that this is a dead end. I paste the code > here (original and "unrolled" version) so that you see the problem, and > maybe you can spot something that I missed. Put your favorite queue > implementation in there. > > Cheers, > Pablo > > > unsigned fibonacci(unsigned n){ > > if (n == 0) > return 0; > else if (n == 1) > return 1; > else > return fibonacci(n - 1) + fibonacci(n - 2); > } > > unsigned fibonacci_opt(unsigned n){ > > // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call > enum {S0, S1, S2}; > > Stack s; > int ret; > init(&s, 256); > push(&s, n); > push(&s, S0); > > while (!empty(s)){ > > /* print(&s); */ > > int state = pop(&s); > n = pop(&s); > > int a, b; > > switch(state){ > case S0: > > if (n <= 0) > ret = 0; > > else if (n == 1) > ret = 1; > > else{ > // On "fake return" we need to move to the end of the first > "recursive call". > push(&s, n); > push(&s, S1); > > // Go for the first "recursive call" > push(&s, n - 1); > push(&s, S0); > } > break; > > case S1: > // On "fake return", we want to recover the return of the previous > call. > push(&s, ret); > push(&s, S2); > > // Go for the second "recursive call" > push(&s, n - 2); > push(&s, S0); > break; > > case S2: > ret += n; > } > } > > delete(&s); > return ret; > } > > -----Original Message----- > From: Hal Finkel [mailto:hfinkel at anl.gov] > Sent: 06 February 2015 02:35 > To: James Molloy > Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio > Subject: Re: RFC: Recursive inlining > > ----- Original Message ----- > > From: "James Molloy" <james at jamesmolloy.co.uk> > > To: "Hal Finkel" <hfinkel at anl.gov> > > Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at google.com>, > > "pablo barrio" <pablo.barrio at arm.com> > > Sent: Thursday, February 5, 2015 9:18:35 AM > > Subject: Re: RFC: Recursive inlining > > > > > > Hi Hal, > > > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > > and its associated 'cnt' variable, is to use dynamic stack > > > allocation. > > > > > > > > I hadn't really reached a decision on the mechanics yet. However, your > > suggestion while it can work within the current abilities of the IR, > > has the disadvantage that it is using an extra stack slot for the link > > pointer. I think if we are going to apply this optimization, we must > > make sure we aren't going to blow the stack. > > > > > > As users may already be just close enough to the max-stack boundary > > that their applications don't crash, I think we should aim as much as > > possible to keep the stack usage no higher than the recursive > > equivalent. Holding the "state" is the equivalent of a return address, > > so adding a stack link pointer would be the equivalent of a frame > > address I suppose... That could be acceptable? > > > > I agree. We should aim not to increase the amount of stack space used. > This will require some target knowledge (because dynamic stack allocation > has some target-dependent overhead because of alignment and other ABI > requirements). > > -Hal > > > > > As I say, I haven't thought about the deep mechanics just yet. > > > > > > James > > > > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov > > > wrote: > > > > > > ----- Original Message ----- > > > From: "James Molloy" < james at jamesmolloy.co.uk > > > > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" < chandlerc at google.com > > > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo barrio" > > > < pablo.barrio at arm.com > > > > Sent: Thursday, February 5, 2015 4:36:16 AM > > > Subject: RFC: Recursive inlining > > > > > > > > > Hi, > > > > > > > > > The other day on IRC myself, Chandler, Hal and Pablo discussed > > > recursive inlining. I've pondered a bit since then, and thought I'd > > > get some background on the list with the aim of stimulating > > > discussion. > > > > > > > > > Our goal is to inline recursive functions. We have several workloads > > > with recursive calls, and inlining them to a certain degree can > > > improve performance by up to 30%. > > > > > > > > > Inlining recursive functions is a slightly different problem to > > > inlining non-recursive functions in that the inlining can never > > > terminate, so you need to choose a factor (number of times) to > > > inline. In this respect it is similar to loop unrolling, and > > > Chandler suggested we should model it as such. > > > > > > > > > We originally thought he meant by this that the mechanics of > > > inlining would stay the same, but the heuristics would have to be > > > changed. > > > This is the approach taken by Rugina & Rinard [1], which is almost > > > the sum of the literature on this topic I could find, which > > > surprised me. However, the actual suggestion was to remove the > > > recursion entirely, and replace it with a loop. > > > > > > > > > Before I get into the mechanics of this, I'd like to explain what > > > the possible optimization opportunities are for recursive functions > > > turned into a loop. > > > > > > > > > 1. Unrolling > > > - This leads to bigger basic blocks (if they can be merged) and > > > exposes more pipeline-level parallelism. > > > - CSE between iterations. > > > - Reduces backbranch overhead (generally negligible impact). > > > 2. Commoning / LICM > > > - Recursion doesn't have a vehicle for moving common code out into a > > > dominating block (compare with LICM for loops). > > > > > > > > > OK, first example: > > > > > > > > > void ex1(i) { > > > if (i == 0) return; > > > a[i] = i+1; > > > ex1(i-1); > > > } > > > > > > > > > gets converted to: > > > > > > > > > while (1) { > > > if (i == 0) break; > > > a[i] = i+1; > > > i = i-1; > > > } > > > > > > > > > Now, the above is the simplest case and is pure tail recursion. We > > > can eliminate this currently with the TailRecursionElimination pass. > > > The more complex case is where we have non-tail recusion and live > > > variables. > > > > > > > > > void ex2(i) { > > > if (i == 0) return; > > > j = a[i]; > > > ex2(i-1) > > > b[j] = i; > > > } > > > > > > > > > This requires a stack being modelled and could be converted to: > > > > > > > > > cnt = 0; > > > while (1) { > > > if (i == 0) break; > > > j = a[i]; > > > stack[cnt++] = j; > > > stack[cnt++] = i; > > > ++cnt; > > > --i; > > > } > > > while (cnt) { > > > i = stack[--cnt]; > > > j = stack[--cnt]; > > > b[j] = i; > > > } > > > > > > > > > The above is as far as the discussion got on IRC. The following are > > > my half-baked thoughts. > > > > > > > > > The above transform works because the access pattern to the stack is > > > strictly linear. Pushes are followed by pops. But I don't think this > > > is the most interesting case for recursion in real programs. I see > > > (qualitatively, poorly) 3 reasons for recursion in code that should > > > go fast: > > > > > > > > > 1. A fibonnacci function in a microbenchmark or compiler shootout. > > > I > > > think we can safely ignore this usecase... > > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc. > > > 3. Traversing a data structure, where the recursion is simply to get > > > to the leaves which is where all the fun stuff happens. > > > > > > > > > Points 2 and 3 both share a trait that they're dividing/fanning out > > > at each step: > > > > > > > > > void ex3(i) { > > > if (is_base(i)) { > > > f(); > > > return; > > > } > > > g(i); > > > ex3(i / 2); > > > ex3(i / 2 + 1); > > > } > > > > > > > > > The access pattern of such a function is not linear. It is a > > > pre-order walk of a (binary, in this case) tree. As such we can't > > > use the two-loop transform above, but would have to form one loop > > > and just manually implement the stack operations: > > > > > > > > > cnt = 0 > > > stack[cnt++] = i; > > > while (1) { > > > i = stack[--cnt]; > > > if (is_base(i)) { > > > f(); continue; > > > } > > > g(i); > > > stack[cnt++] = i / 2 + 1; > > > stack[cnt++] = i / 2; > > > } > > > > > > > > > OK, it's still a well-formed loop, we can still do useful operations > > > on it like LICM or, if it's profitable, unrolling. > > > > > > > > > Now what about a more complex case, like in the FFT butterfly > > > algorithm: > > > > > > > > > void ex4(i) { > > > if (is_base(i)) { > > > f(i); return; > > > } > > > g(i); > > > ex4(i / 2); > > > ex4(i / 2 + 1); > > > h(i); > > > } > > > > > > > > > Here, we have to do work after the last recursive call. This > > > presents a problem, because we don't have any of the caller's > > > context when dealing with a node. Is a node N the first recursive > > > call? or the second? Call-stack recursion can save two things: the > > > state of variables and the return address which serves as another > > > kind of state. The obvious answer is to turn this into a sort of > > > state > > > machine: > > > > > > > > > cnt = 0 > > > stack[cnt++] = i; > > > stack[cnt++] = STATE_1; > > > while (1) { > > > state = stack[--cnt]; > > > i = stack[--cnt]; > > > > > > > > > if (state == STATE_FINISH) { > > > h(i); > > > continue; > > > } > > > > > > > > > if (is_base(i)) { > > > f(); continue; > > > } > > > > > > > > > g(i); > > > stack[cnt++] = i; // Push ourself again, in the final state this > > > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1; > > > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] > > > STATE_1; } > > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > and its associated 'cnt' variable, is to use dynamic stack allocation. > > You can do this in general at the IR level (using allocas that just > > happen not to be in the entry block). If each stack block holds a > > pointer to the previous one (as a linked list), then you just need to > > maintain a pointer to the current block, and you don't need a separate > > 'cnt' variable (you might also keep a pointer to the next block for > > allocation reuse). Is that what you had in mind? > > > > -Hal > > > > > > > > > > > This solution can be generalised for if there is code in between the > > > recursive calls (just add more states). It's getting more complex in > > > terms of control flow though - the question is, is it worth it? > > > > > > > > > Unrolling this loop is probably not going to usefully create larger > > > basic blocks due to how complex the control flow is. I suppose we > > > have the opportunity, if we unroll, of CSE and jump threading values > > > through equivalent blocks that may gain us time. However, we can > > > still do LICM which is probably a major improvement, and there are > > > no calls/returns to speculate. > > > > > > > > > This approach would have to be extended if the recursive calls > > > didn't return void (for example it was a recursive reduction), but > > > it's not impossible. > > > > > > > > > Sorry for the long post. What do people think? > > > > > > > > > Cheers, > > > > > > > > > James > > > > > > > > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3 > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > > -- IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2557590 > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2548782 > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150218/8dbae05e/attachment.html>
The manual transformation of this form seems to work well enough for me; unsigned fibonacci(unsigned n){ if (n <= 1) return n; return fibonacci(n - 1) + fibonacci(n - 2); } time fib(40) real 0m1.100s user 0m1.096s sys 0m0.004s after; unsigned fibonacci(unsigned argn){ unsigned stack[1024]; unsigned pos=0; unsigned n=argn; unsigned ret; unsigned tmp1; stack[pos++]=0; start: if (n <= 1){ ret=n; goto pop; } stack[pos++]=n; // push n as it's live stack[pos++]=1; // push state n= n - 1; // set call argument goto start; pop: // pop state and jump switch(stack[--pos]){ case 0: return ret; case 1: n=stack[--pos]; // pop n stack[pos++]=ret; // push ret as it's live stack[pos++]=2; // push state n= n - 2; // set call argument goto start; case 2: tmp1 = stack[--pos]; // pop temporary return value ret = tmp1 + ret; goto pop; } } time fib(40) real 0m0.772s user 0m0.772s sys 0m0.003s But with your 3 return example the stack based code is ending up slower after optimisations; ... if (n == 0){ ret=0; goto pop; } if (n == 1){ ret=1; goto pop; } ... time fib(40) real 0m1.233s user 0m1.233s sys 0m0.000s On Thu, Feb 19, 2015 at 7:27 AM, Pablo Barrio <Pablo.Barrio at arm.com> wrote:> Hi, > > Apologies for the very late response. > > We have manually tried the idea with a very simple Fibonacci sequence > code. While being very very simple, the recursion cannot be handled by TRE. > Because there are two recursive callsites, it also needs to keep some sort > of state across iterations of the "while(stack not empty)" loop. > > We get between 2.5 and 8x slowdowns depending on which processor, and the > branch mispredictions are huge (specially in x86: around 100 times higher). > We are replacing the function call mechanism by a stack + switch statement > to mimic the control flow for callsites. My understanding is that the > branches in the switch are hardly predictable: they depend on a value taken > from the stack at each iteration. Things could get harder as we increase > the number of recursive callsites in a function (we increase the size of > the switch). > > Also, virtually all processors beyond a certain level of sophistication > (let's say, meant to run an OS) feature a return-stack prediction > mechanism. This results in highly-efficient branch prediction for > calls+returns. Since we are transforming the "call jumps" into "switch > jumps", we are probably misusing that piece of hardware. > > Of course, this is not to say that this is a dead end. I paste the code > here (original and "unrolled" version) so that you see the problem, and > maybe you can spot something that I missed. Put your favorite queue > implementation in there. > > Cheers, > Pablo > > > unsigned fibonacci(unsigned n){ > > if (n == 0) > return 0; > else if (n == 1) > return 1; > else > return fibonacci(n - 1) + fibonacci(n - 2); > } > > unsigned fibonacci_opt(unsigned n){ > > // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call > enum {S0, S1, S2}; > > Stack s; > int ret; > init(&s, 256); > push(&s, n); > push(&s, S0); > > while (!empty(s)){ > > /* print(&s); */ > > int state = pop(&s); > n = pop(&s); > > int a, b; > > switch(state){ > case S0: > > if (n <= 0) > ret = 0; > > else if (n == 1) > ret = 1; > > else{ > // On "fake return" we need to move to the end of the first > "recursive call". > push(&s, n); > push(&s, S1); > > // Go for the first "recursive call" > push(&s, n - 1); > push(&s, S0); > } > break; > > case S1: > // On "fake return", we want to recover the return of the previous > call. > push(&s, ret); > push(&s, S2); > > // Go for the second "recursive call" > push(&s, n - 2); > push(&s, S0); > break; > > case S2: > ret += n; > } > } > > delete(&s); > return ret; > } > > -----Original Message----- > From: Hal Finkel [mailto:hfinkel at anl.gov] > Sent: 06 February 2015 02:35 > To: James Molloy > Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio > Subject: Re: RFC: Recursive inlining > > ----- Original Message ----- > > From: "James Molloy" <james at jamesmolloy.co.uk> > > To: "Hal Finkel" <hfinkel at anl.gov> > > Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at google.com>, > > "pablo barrio" <pablo.barrio at arm.com> > > Sent: Thursday, February 5, 2015 9:18:35 AM > > Subject: Re: RFC: Recursive inlining > > > > > > Hi Hal, > > > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > > and its associated 'cnt' variable, is to use dynamic stack > > > allocation. > > > > > > > > I hadn't really reached a decision on the mechanics yet. However, your > > suggestion while it can work within the current abilities of the IR, > > has the disadvantage that it is using an extra stack slot for the link > > pointer. I think if we are going to apply this optimization, we must > > make sure we aren't going to blow the stack. > > > > > > As users may already be just close enough to the max-stack boundary > > that their applications don't crash, I think we should aim as much as > > possible to keep the stack usage no higher than the recursive > > equivalent. Holding the "state" is the equivalent of a return address, > > so adding a stack link pointer would be the equivalent of a frame > > address I suppose... That could be acceptable? > > > > I agree. We should aim not to increase the amount of stack space used. > This will require some target knowledge (because dynamic stack allocation > has some target-dependent overhead because of alignment and other ABI > requirements). > > -Hal > > > > > As I say, I haven't thought about the deep mechanics just yet. > > > > > > James > > > > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov > > > wrote: > > > > > > ----- Original Message ----- > > > From: "James Molloy" < james at jamesmolloy.co.uk > > > > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" < chandlerc at google.com > > > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo barrio" > > > < pablo.barrio at arm.com > > > > Sent: Thursday, February 5, 2015 4:36:16 AM > > > Subject: RFC: Recursive inlining > > > > > > > > > Hi, > > > > > > > > > The other day on IRC myself, Chandler, Hal and Pablo discussed > > > recursive inlining. I've pondered a bit since then, and thought I'd > > > get some background on the list with the aim of stimulating > > > discussion. > > > > > > > > > Our goal is to inline recursive functions. We have several workloads > > > with recursive calls, and inlining them to a certain degree can > > > improve performance by up to 30%. > > > > > > > > > Inlining recursive functions is a slightly different problem to > > > inlining non-recursive functions in that the inlining can never > > > terminate, so you need to choose a factor (number of times) to > > > inline. In this respect it is similar to loop unrolling, and > > > Chandler suggested we should model it as such. > > > > > > > > > We originally thought he meant by this that the mechanics of > > > inlining would stay the same, but the heuristics would have to be > > > changed. > > > This is the approach taken by Rugina & Rinard [1], which is almost > > > the sum of the literature on this topic I could find, which > > > surprised me. However, the actual suggestion was to remove the > > > recursion entirely, and replace it with a loop. > > > > > > > > > Before I get into the mechanics of this, I'd like to explain what > > > the possible optimization opportunities are for recursive functions > > > turned into a loop. > > > > > > > > > 1. Unrolling > > > - This leads to bigger basic blocks (if they can be merged) and > > > exposes more pipeline-level parallelism. > > > - CSE between iterations. > > > - Reduces backbranch overhead (generally negligible impact). > > > 2. Commoning / LICM > > > - Recursion doesn't have a vehicle for moving common code out into a > > > dominating block (compare with LICM for loops). > > > > > > > > > OK, first example: > > > > > > > > > void ex1(i) { > > > if (i == 0) return; > > > a[i] = i+1; > > > ex1(i-1); > > > } > > > > > > > > > gets converted to: > > > > > > > > > while (1) { > > > if (i == 0) break; > > > a[i] = i+1; > > > i = i-1; > > > } > > > > > > > > > Now, the above is the simplest case and is pure tail recursion. We > > > can eliminate this currently with the TailRecursionElimination pass. > > > The more complex case is where we have non-tail recusion and live > > > variables. > > > > > > > > > void ex2(i) { > > > if (i == 0) return; > > > j = a[i]; > > > ex2(i-1) > > > b[j] = i; > > > } > > > > > > > > > This requires a stack being modelled and could be converted to: > > > > > > > > > cnt = 0; > > > while (1) { > > > if (i == 0) break; > > > j = a[i]; > > > stack[cnt++] = j; > > > stack[cnt++] = i; > > > ++cnt; > > > --i; > > > } > > > while (cnt) { > > > i = stack[--cnt]; > > > j = stack[--cnt]; > > > b[j] = i; > > > } > > > > > > > > > The above is as far as the discussion got on IRC. The following are > > > my half-baked thoughts. > > > > > > > > > The above transform works because the access pattern to the stack is > > > strictly linear. Pushes are followed by pops. But I don't think this > > > is the most interesting case for recursion in real programs. I see > > > (qualitatively, poorly) 3 reasons for recursion in code that should > > > go fast: > > > > > > > > > 1. A fibonnacci function in a microbenchmark or compiler shootout. > > > I > > > think we can safely ignore this usecase... > > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc. > > > 3. Traversing a data structure, where the recursion is simply to get > > > to the leaves which is where all the fun stuff happens. > > > > > > > > > Points 2 and 3 both share a trait that they're dividing/fanning out > > > at each step: > > > > > > > > > void ex3(i) { > > > if (is_base(i)) { > > > f(); > > > return; > > > } > > > g(i); > > > ex3(i / 2); > > > ex3(i / 2 + 1); > > > } > > > > > > > > > The access pattern of such a function is not linear. It is a > > > pre-order walk of a (binary, in this case) tree. As such we can't > > > use the two-loop transform above, but would have to form one loop > > > and just manually implement the stack operations: > > > > > > > > > cnt = 0 > > > stack[cnt++] = i; > > > while (1) { > > > i = stack[--cnt]; > > > if (is_base(i)) { > > > f(); continue; > > > } > > > g(i); > > > stack[cnt++] = i / 2 + 1; > > > stack[cnt++] = i / 2; > > > } > > > > > > > > > OK, it's still a well-formed loop, we can still do useful operations > > > on it like LICM or, if it's profitable, unrolling. > > > > > > > > > Now what about a more complex case, like in the FFT butterfly > > > algorithm: > > > > > > > > > void ex4(i) { > > > if (is_base(i)) { > > > f(i); return; > > > } > > > g(i); > > > ex4(i / 2); > > > ex4(i / 2 + 1); > > > h(i); > > > } > > > > > > > > > Here, we have to do work after the last recursive call. This > > > presents a problem, because we don't have any of the caller's > > > context when dealing with a node. Is a node N the first recursive > > > call? or the second? Call-stack recursion can save two things: the > > > state of variables and the return address which serves as another > > > kind of state. The obvious answer is to turn this into a sort of > > > state > > > machine: > > > > > > > > > cnt = 0 > > > stack[cnt++] = i; > > > stack[cnt++] = STATE_1; > > > while (1) { > > > state = stack[--cnt]; > > > i = stack[--cnt]; > > > > > > > > > if (state == STATE_FINISH) { > > > h(i); > > > continue; > > > } > > > > > > > > > if (is_base(i)) { > > > f(); continue; > > > } > > > > > > > > > g(i); > > > stack[cnt++] = i; // Push ourself again, in the final state this > > > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1; > > > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] > > > STATE_1; } > > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > and its associated 'cnt' variable, is to use dynamic stack allocation. > > You can do this in general at the IR level (using allocas that just > > happen not to be in the entry block). If each stack block holds a > > pointer to the previous one (as a linked list), then you just need to > > maintain a pointer to the current block, and you don't need a separate > > 'cnt' variable (you might also keep a pointer to the next block for > > allocation reuse). Is that what you had in mind? > > > > -Hal > > > > > > > > > > > This solution can be generalised for if there is code in between the > > > recursive calls (just add more states). It's getting more complex in > > > terms of control flow though - the question is, is it worth it? > > > > > > > > > Unrolling this loop is probably not going to usefully create larger > > > basic blocks due to how complex the control flow is. I suppose we > > > have the opportunity, if we unroll, of CSE and jump threading values > > > through equivalent blocks that may gain us time. However, we can > > > still do LICM which is probably a major improvement, and there are > > > no calls/returns to speculate. > > > > > > > > > This approach would have to be extended if the recursive calls > > > didn't return void (for example it was a recursive reduction), but > > > it's not impossible. > > > > > > > > > Sorry for the long post. What do people think? > > > > > > > > > Cheers, > > > > > > > > > James > > > > > > > > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3 > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > > -- IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2557590 > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > Registered in England & Wales, Company No: 2548782 > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150220/b11206e4/attachment.html>
Chandler, thank you very much for working on this. My manual implementation was a) using an extra unnecessary push+pop, and b) resulting in a very twisted and unoptimizable control flow. The one by Jeremy Lakeman is better. Let me know if I can be of help with benchmarking/tuning for ARM architectures or anything else. From: Chandler Carruth [mailto:chandlerc at google.com] Sent: 18 February 2015 21:01 To: Pablo Barrio Cc: Hal Finkel; James Molloy; llvmdev at cs.uiuc.edu Subject: Re: RFC: Recursive inlining On Wed, Feb 18, 2015 at 12:57 PM, Pablo Barrio <Pablo.Barrio at arm.com> wrote: Also, virtually all processors beyond a certain level of sophistication (let's say, meant to run an OS) feature a return-stack prediction mechanism. This results in highly-efficient branch prediction for calls+returns. Since we are transforming the "call jumps" into "switch jumps", we are probably misusing that piece of hardware. Yes, it will be critical to use a branching pattern that similarly fits modern predictors. I don't think this will be hard though. Again, I'm continuing to work on a prototype actual pass. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150220/09546b8f/attachment.html>
Thank you Jeremy, I compared the CFG after optimizations for my version against yours, and mine just looks atrocious, although I’m not sure why. From: jeremy at lakeman.me [mailto:jeremy at lakeman.me] On Behalf Of Jeremy Lakeman Sent: 20 February 2015 06:40 To: Pablo Barrio Cc: Hal Finkel; James Molloy; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] RFC: Recursive inlining The manual transformation of this form seems to work well enough for me; unsigned fibonacci(unsigned n){ if (n <= 1) return n; return fibonacci(n - 1) + fibonacci(n - 2); } time fib(40) real 0m1.100s user 0m1.096s sys 0m0.004s after; unsigned fibonacci(unsigned argn){ unsigned stack[1024]; unsigned pos=0; unsigned n=argn; unsigned ret; unsigned tmp1; stack[pos++]=0; start: if (n <= 1){ ret=n; goto pop; } stack[pos++]=n; // push n as it's live stack[pos++]=1; // push state n= n - 1; // set call argument goto start; pop: // pop state and jump switch(stack[--pos]){ case 0: return ret; case 1: n=stack[--pos]; // pop n stack[pos++]=ret; // push ret as it's live stack[pos++]=2; // push state n= n - 2; // set call argument goto start; case 2: tmp1 = stack[--pos]; // pop temporary return value ret = tmp1 + ret; goto pop; } } time fib(40) real 0m0.772s user 0m0.772s sys 0m0.003s But with your 3 return example the stack based code is ending up slower after optimisations; ... if (n == 0){ ret=0; goto pop; } if (n == 1){ ret=1; goto pop; } ... time fib(40) real 0m1.233s user 0m1.233s sys 0m0.000s On Thu, Feb 19, 2015 at 7:27 AM, Pablo Barrio <Pablo.Barrio at arm.com<mailto:Pablo.Barrio at arm.com>> wrote: Hi, Apologies for the very late response. We have manually tried the idea with a very simple Fibonacci sequence code. While being very very simple, the recursion cannot be handled by TRE. Because there are two recursive callsites, it also needs to keep some sort of state across iterations of the "while(stack not empty)" loop. We get between 2.5 and 8x slowdowns depending on which processor, and the branch mispredictions are huge (specially in x86: around 100 times higher). We are replacing the function call mechanism by a stack + switch statement to mimic the control flow for callsites. My understanding is that the branches in the switch are hardly predictable: they depend on a value taken from the stack at each iteration. Things could get harder as we increase the number of recursive callsites in a function (we increase the size of the switch). Also, virtually all processors beyond a certain level of sophistication (let's say, meant to run an OS) feature a return-stack prediction mechanism. This results in highly-efficient branch prediction for calls+returns. Since we are transforming the "call jumps" into "switch jumps", we are probably misusing that piece of hardware. Of course, this is not to say that this is a dead end. I paste the code here (original and "unrolled" version) so that you see the problem, and maybe you can spot something that I missed. Put your favorite queue implementation in there. Cheers, Pablo unsigned fibonacci(unsigned n){ if (n == 0) return 0; else if (n == 1) return 1; else return fibonacci(n - 1) + fibonacci(n - 2); } unsigned fibonacci_opt(unsigned n){ // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call enum {S0, S1, S2}; Stack s; int ret; init(&s, 256); push(&s, n); push(&s, S0); while (!empty(s)){ /* print(&s); */ int state = pop(&s); n = pop(&s); int a, b; switch(state){ case S0: if (n <= 0) ret = 0; else if (n == 1) ret = 1; else{ // On "fake return" we need to move to the end of the first "recursive call". push(&s, n); push(&s, S1); // Go for the first "recursive call" push(&s, n - 1); push(&s, S0); } break; case S1: // On "fake return", we want to recover the return of the previous call. push(&s, ret); push(&s, S2); // Go for the second "recursive call" push(&s, n - 2); push(&s, S0); break; case S2: ret += n; } } delete(&s); return ret; } -----Original Message----- From: Hal Finkel [mailto:hfinkel at anl.gov<mailto:hfinkel at anl.gov>] Sent: 06 February 2015 02:35 To: James Molloy Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>; Chandler Carruth; Pablo Barrio Subject: Re: RFC: Recursive inlining ----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk<mailto:james at jamesmolloy.co.uk>> > To: "Hal Finkel" <hfinkel at anl.gov<mailto:hfinkel at anl.gov>> > Cc: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>, "Chandler Carruth" <chandlerc at google.com<mailto:chandlerc at google.com>>, > "pablo barrio" <pablo.barrio at arm.com<mailto:pablo.barrio at arm.com>> > Sent: Thursday, February 5, 2015 9:18:35 AM > Subject: Re: RFC: Recursive inlining > > > Hi Hal, > > > As we had briefly mentioned on IRC, one way of forming this 'stack', > > and its associated 'cnt' variable, is to use dynamic stack > > allocation. > > > > I hadn't really reached a decision on the mechanics yet. However, your > suggestion while it can work within the current abilities of the IR, > has the disadvantage that it is using an extra stack slot for the link > pointer. I think if we are going to apply this optimization, we must > make sure we aren't going to blow the stack. > > > As users may already be just close enough to the max-stack boundary > that their applications don't crash, I think we should aim as much as > possible to keep the stack usage no higher than the recursive > equivalent. Holding the "state" is the equivalent of a return address, > so adding a stack link pointer would be the equivalent of a frame > address I suppose... That could be acceptable? >I agree. We should aim not to increase the amount of stack space used. This will require some target knowledge (because dynamic stack allocation has some target-dependent overhead because of alignment and other ABI requirements). -Hal> > As I say, I haven't thought about the deep mechanics just yet. > > > James > > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov<mailto:hfinkel at anl.gov> > > wrote: > > > ----- Original Message ----- > > From: "James Molloy" < james at jamesmolloy.co.uk<mailto:james at jamesmolloy.co.uk> > > > To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> , "Chandler Carruth" < chandlerc at google.com<mailto:chandlerc at google.com> > > >, "Hal Finkel" < hfinkel at anl.gov<mailto:hfinkel at anl.gov> >, "pablo barrio" > > < pablo.barrio at arm.com<mailto:pablo.barrio at arm.com> > > > Sent: Thursday, February 5, 2015 4:36:16 AM > > Subject: RFC: Recursive inlining > > > > > > Hi, > > > > > > The other day on IRC myself, Chandler, Hal and Pablo discussed > > recursive inlining. I've pondered a bit since then, and thought I'd > > get some background on the list with the aim of stimulating > > discussion. > > > > > > Our goal is to inline recursive functions. We have several workloads > > with recursive calls, and inlining them to a certain degree can > > improve performance by up to 30%. > > > > > > Inlining recursive functions is a slightly different problem to > > inlining non-recursive functions in that the inlining can never > > terminate, so you need to choose a factor (number of times) to > > inline. In this respect it is similar to loop unrolling, and > > Chandler suggested we should model it as such. > > > > > > We originally thought he meant by this that the mechanics of > > inlining would stay the same, but the heuristics would have to be > > changed. > > This is the approach taken by Rugina & Rinard [1], which is almost > > the sum of the literature on this topic I could find, which > > surprised me. However, the actual suggestion was to remove the > > recursion entirely, and replace it with a loop. > > > > > > Before I get into the mechanics of this, I'd like to explain what > > the possible optimization opportunities are for recursive functions > > turned into a loop. > > > > > > 1. Unrolling > > - This leads to bigger basic blocks (if they can be merged) and > > exposes more pipeline-level parallelism. > > - CSE between iterations. > > - Reduces backbranch overhead (generally negligible impact). > > 2. Commoning / LICM > > - Recursion doesn't have a vehicle for moving common code out into a > > dominating block (compare with LICM for loops). > > > > > > OK, first example: > > > > > > void ex1(i) { > > if (i == 0) return; > > a[i] = i+1; > > ex1(i-1); > > } > > > > > > gets converted to: > > > > > > while (1) { > > if (i == 0) break; > > a[i] = i+1; > > i = i-1; > > } > > > > > > Now, the above is the simplest case and is pure tail recursion. We > > can eliminate this currently with the TailRecursionElimination pass. > > The more complex case is where we have non-tail recusion and live > > variables. > > > > > > void ex2(i) { > > if (i == 0) return; > > j = a[i]; > > ex2(i-1) > > b[j] = i; > > } > > > > > > This requires a stack being modelled and could be converted to: > > > > > > cnt = 0; > > while (1) { > > if (i == 0) break; > > j = a[i]; > > stack[cnt++] = j; > > stack[cnt++] = i; > > ++cnt; > > --i; > > } > > while (cnt) { > > i = stack[--cnt]; > > j = stack[--cnt]; > > b[j] = i; > > } > > > > > > The above is as far as the discussion got on IRC. The following are > > my half-baked thoughts. > > > > > > The above transform works because the access pattern to the stack is > > strictly linear. Pushes are followed by pops. But I don't think this > > is the most interesting case for recursion in real programs. I see > > (qualitatively, poorly) 3 reasons for recursion in code that should > > go fast: > > > > > > 1. A fibonnacci function in a microbenchmark or compiler shootout. > > I > > think we can safely ignore this usecase... > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc. > > 3. Traversing a data structure, where the recursion is simply to get > > to the leaves which is where all the fun stuff happens. > > > > > > Points 2 and 3 both share a trait that they're dividing/fanning out > > at each step: > > > > > > void ex3(i) { > > if (is_base(i)) { > > f(); > > return; > > } > > g(i); > > ex3(i / 2); > > ex3(i / 2 + 1); > > } > > > > > > The access pattern of such a function is not linear. It is a > > pre-order walk of a (binary, in this case) tree. As such we can't > > use the two-loop transform above, but would have to form one loop > > and just manually implement the stack operations: > > > > > > cnt = 0 > > stack[cnt++] = i; > > while (1) { > > i = stack[--cnt]; > > if (is_base(i)) { > > f(); continue; > > } > > g(i); > > stack[cnt++] = i / 2 + 1; > > stack[cnt++] = i / 2; > > } > > > > > > OK, it's still a well-formed loop, we can still do useful operations > > on it like LICM or, if it's profitable, unrolling. > > > > > > Now what about a more complex case, like in the FFT butterfly > > algorithm: > > > > > > void ex4(i) { > > if (is_base(i)) { > > f(i); return; > > } > > g(i); > > ex4(i / 2); > > ex4(i / 2 + 1); > > h(i); > > } > > > > > > Here, we have to do work after the last recursive call. This > > presents a problem, because we don't have any of the caller's > > context when dealing with a node. Is a node N the first recursive > > call? or the second? Call-stack recursion can save two things: the > > state of variables and the return address which serves as another > > kind of state. The obvious answer is to turn this into a sort of > > state > > machine: > > > > > > cnt = 0 > > stack[cnt++] = i; > > stack[cnt++] = STATE_1; > > while (1) { > > state = stack[--cnt]; > > i = stack[--cnt]; > > > > > > if (state == STATE_FINISH) { > > h(i); > > continue; > > } > > > > > > if (is_base(i)) { > > f(); continue; > > } > > > > > > g(i); > > stack[cnt++] = i; // Push ourself again, in the final state this > > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1; > > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] > > STATE_1; } > > As we had briefly mentioned on IRC, one way of forming this 'stack', > and its associated 'cnt' variable, is to use dynamic stack allocation. > You can do this in general at the IR level (using allocas that just > happen not to be in the entry block). If each stack block holds a > pointer to the previous one (as a linked list), then you just need to > maintain a pointer to the current block, and you don't need a separate > 'cnt' variable (you might also keep a pointer to the next block for > allocation reuse). Is that what you had in mind? > > -Hal > > > > > > > This solution can be generalised for if there is code in between the > > recursive calls (just add more states). It's getting more complex in > > terms of control flow though - the question is, is it worth it? > > > > > > Unrolling this loop is probably not going to usefully create larger > > basic blocks due to how complex the control flow is. I suppose we > > have the opportunity, if we unroll, of CSE and jump threading values > > through equivalent blocks that may gain us time. However, we can > > still do LICM which is probably a major improvement, and there are > > no calls/returns to speculate. > > > > > > This approach would have to be extended if the recursive calls > > didn't return void (for example it was a recursive reduction), but > > it's not impossible. > > > > > > Sorry for the long post. What do people think? > > > > > > Cheers, > > > > > > James > > > > > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3 > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150220/215ae3ca/attachment.html>