Max Bolingbroke
2010-Nov-05 12:56 UTC
[LLVMdev] Hoisting elements of array argument into registers
Hi, The Glasgow Haskell Compiler LLVM backend is generating a lot of code which appears to be optimised by LLVM quite poorly. The problem is demonstrated by this C source code: int wf(int sp[]) { if (sp[0] == 0) { return sp[2] + sp[3]; } else { sp[3] = sp[3] + (sp[1] * 5); sp[2] = (sp[2] + sp[0]) + 1; sp[1] = sp[1] - 1; sp[0] = sp[0] - 1; return wf(sp); } } int g(int a) { int sp[] = { a, a, 0, 0 }; return wf(sp); } int main(void) { printf("%d\n", g(5)); return 0; } As you can see, wf takes its arguments via a pointer. Nonetheless, we would do like to loop optimisations on wf -- in particular, we would like to eliminate the duplicate computations in sp[0] and sp[1] (GHC's own optimiser does not do any classical loop optimisations like this). Unfortunately, LLVM (opt -O3: I tried v2.7 and HEAD) doesn't seem to get very far. It eliminates the tail call for wf() correctly, but the arguments are carried between each iteration of the loop by storing intothe sp array. It would be better if the arguments were kept in registers during the loop and only stored into the array upon exit. (These stores could then be eliminated easily since there is no later load from sp). If I manually rewrite the code to replace the array argument with its composite scalars in the following manner LLVM obviously does the right thing: int wf(int a, int b, int c, int d) { if (a == 0) { return c + d; } else { return wf(a - 1, b - 1, (c + a) + 1, d + (b * 5)); } } int g(int a) { return wf(a, a, 0, 0); } int main(void) { printf("%d\n", g(5)); return 0; } The question is, is there any way to get LLVM to do this rewriting for us? It seems like this would be a generally useful optimisation. I guess you could also achieve this by sinking the stores to sp into next loop iteration/the loop exit, where they will meet the corresponding loads and so good things will happen. The current instruction sinking pass doesn't seem to get this case, though. Thanks in advance for any help! Max p.s. Our real code is a little more complicated because sp is really a pointer to the top of the Haskell runtimes stack (which is of unknown length), not a 4-element array, which might make the optimisation a bit harder. Perhaps LLVM would have to promote just those elements of the input array which were accessed in the loop into scalars.
Duncan Sands
2010-Nov-05 14:51 UTC
[LLVMdev] Hoisting elements of array argument into registers
Hi Max,> The Glasgow Haskell Compiler LLVM backend is generating a lot of code which > appears to be optimised by LLVM quite poorly. The problem is demonstrated by > this C source code:if I run this at -O3 under llvm-gcc from top-of-tree on x86-64 linux then (1) it computes that g(5) is equal to 95, and main is transformed to just print 95. (2) g is transformed into code with no loop, i.e. the result is computed directly from the value of "a" in straight line code (there is a special case when a is zero). (3) wf is not transformed as well: values are loaded from memory and stored back on each loop. And there is still a loop. I see the same with clang. I'm not sure why the optimizers do so much better when they can see that sp is a local array (the special initial values don't matter). Ciao, Duncan.> > int wf(int sp[]) { > if (sp[0] == 0) { > return sp[2] + sp[3]; > } else { > sp[3] = sp[3] + (sp[1] * 5); > sp[2] = (sp[2] + sp[0]) + 1; > sp[1] = sp[1] - 1; > sp[0] = sp[0] - 1; > return wf(sp); > } > } > > int g(int a) { > int sp[] = { a, a, 0, 0 }; > return wf(sp); > } > > int main(void) { > printf("%d\n", g(5)); > return 0; > } > > As you can see, wf takes its arguments via a pointer. Nonetheless, we would do > like to loop optimisations on wf -- in particular, we would like to eliminate > the duplicate computations in sp[0] and sp[1] (GHC's own optimiser does not do > any classical loop optimisations like this). > > Unfortunately, LLVM (opt -O3: I tried v2.7 and HEAD) doesn't seem to get very > far. It eliminates the tail call for wf() correctly, but the arguments are > carried between each iteration of the loop by storing intothe sp array. It > would be better if the arguments were kept in registers during the loop and > only stored into the array upon exit. (These stores could then be eliminated > easily since there is no later load from sp). > > If I manually rewrite the code to replace the array argument with its composite > scalars in the following manner LLVM obviously does the right thing: > > int wf(int a, int b, int c, int d) { > if (a == 0) { > return c + d; > } else { > return wf(a - 1, b - 1, (c + a) + 1, d + (b * 5)); > } > } > > int g(int a) { > return wf(a, a, 0, 0); > } > > int main(void) { > printf("%d\n", g(5)); > return 0; > } > > The question is, is there any way to get LLVM to do this rewriting for us? It > seems like this would be a generally useful optimisation. > > I guess you could also achieve this by sinking the stores to sp into next loop > iteration/the loop exit, where they will meet the corresponding loads and so > good things will happen. The current instruction sinking pass doesn't seem to > get this case, though. > > Thanks in advance for any help! > Max > > p.s. Our real code is a little more complicated because sp is really a pointer > to the top of the Haskell runtimes stack (which is of unknown length), not a > 4-element array, which might make the optimisation a bit harder. Perhaps LLVM > would have to promote just those elements of the input array which were > accessed in the loop into scalars. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Duncan Sands
2010-Nov-05 16:51 UTC
[LLVMdev] Hoisting elements of array argument into registers
> I see the same with clang. I'm not sure why the optimizers do so much better > when they can see that sp is a local array (the special initial values don't > matter).It is the scalar replacement of aggregates pass that puts everything into registers when sp is a local array. What happens is: the tail recursion in wf is eliminated. wf is inlined into g. scalarrepl turns the (local) array accesses in g into registers. Once everything is in registers, later passes clean everything up. Ciao, Duncan.
Possibly Parallel Threads
- [LLVMdev] Hoisting elements of array argument into registers
- [LLVMdev] Hoisting elements of array argument into registers
- [LLVMdev] Hoisting elements of array argument into registers
- [LLVMdev] Hoisting elements of array argument into registers
- [LLVMdev] Hoisting elements of array argument into registers