John Mosby
2009-Mar-01 22:57 UTC
[LLVMdev] Shrink Wrapping - RFC and initial implementation
First, thanks very much for your comments! On Sat, Feb 28, 2009 at 8:05 PM, Evan Cheng <evan.cheng at apple.com> wrote:> > On Feb 26, 2009, at 2:02 PM, John Mosby wrote: > > It is limited to X86 presently since that is the only target I have > > access to at the moment. > > What part of this is target dependent? Is this due to emitPrologue / > emitEpilogue being target specific? >It is target dependent (X86) at present because of the way I developed it, just using the X86 target since that is the only one on which I can test the entire (static) flow: test.c -> llvm-gcc -emit-llvm -> (.ll, .bc) -> llc --shrink-wrap -> .s -> gcc test.s -o test. I worked with other targets also, but I decided to take it as far as I could on the first go with X86. First pass was without debugging info and with simple stack frames, in the interest of getting as much worked out as possible. I saw the issue concerning how code gen handles placement of spill and restore code outside of entry/return blocks before I had the first test cases running, but I worked through the details using -march=x86 only. Re: debugging info: I know about the work to change the way debugging info is handled, so I held off trying to make the shrink wrapping work with the current impl.> > The main features are: > > - Placing callee saved register (CSR) spills and restores in the > > CFG to tightly surround uses > > so that execution paths that do not use CSRs do not pay the > > spill/restore penalty. > > > > - Avoiding placment of spills/restores in loops: if a CSR is used > > inside a loop(nest), the spills > > are placed in the loop preheader, and restores are placed in > > the loop exit nodes (the > > successors of the loop _exiting_ nodes). > > > > - Covering paths without CSR uses: e.g. if a restore is placed in > > a join block, a matching spill > > is added to the end of all immediate predecessor blocks that > > are not reached by a spill. > > Similarly for saves placed in branch blocks. > > Sounds great. It would help everyone if you can show some examples code.I am putting documented examples together from the test cases in the patch.> Since I ran into a non-trivial issue in developing this pass, I > > would like to submit my implementation as a "RFC + code/tests" > > rather than a typical contribution, and get people's opinions on how > > to proceed. > > > > The issue is that the code generator assumes all spills and restores > > of callee saved registers must be placed in the entry and return > > blocks resp. > > Shink wrapping violates this assumption, with the result that the > > frame offsets computed by PEI for frame-relative operands may be > > incorrect. > > I am not sure how this would happen. Why would frame offsets be > affected by where these instructions are placed?The issue is illustrated by a simple example in which a single CSR is used in one branch of a conditional. When the stack frame is laid out, the spill for this CSR is accounted for in the calculation of stack size as it should be. The stack slot for the CSR is correctly computed and everything seems fine when the MO_FrameIndex are replaced. The problem is that since the spill instruction for the CSR (e.g. pushl %esi) is moved from the entry block, the push does not happen, and the value of %esp in the entry block is not what it should be to satisfy the offsets produced by eliminateFrameIndex(). A similar situation exists for the BB into which a spill is "moved" (from the entry block): a push happens to spill the CSR on entry to the block, and now %esp is not what it should be for that block. The example below illustrates this issue: assume: int F(int a, int b, int c) uses one CSR in conditional branch prologue, no shrink wrapping: _F: pushl %esi # spill CSR %csi, %esp -= 4 (in this case) subl $56, %esp # create frame, %esp = %esp - 56 movl 64(%esp), %eax # fetch arg 'a' from entry %esp + 4 movl %eax, 52(%esp) movl 68(%esp), %eax # fetch arg 'b' movl %eax, 48(%esp) ... prologue with spill shrink-wrapped to basic block bb: _F: # no spill of %esi, moved to bb subl $56, %esp # create frame same as before %esp = %esp - 56 movl 64(%esp), %eax # error: 'a' is not at 64(%esp), it's at 60(%esp) movl %eax, 52(%esp) ... The simple, ugly hack of adjusting the value by which %esp is decremented in the prologue when one or more CSR spills have been placed into other blocks takes care of the issue on this simple code (no dynamic stack realign., nor EH) on x86. The companion hack for (non entry) MBBs into which spills have been introduced is to adjust the stack size around eliminateFrameIndex()'s for replacement of MO_FrameIndex operands. Obviously, all of this applies only when spills are done with push/pop, which is the case on x86. I used this issue to start looking at generalizing how spills and restores are handled, before looking too closely at other targets, and developed the workaround for the initial implementation.> > > > My limited implementation uses a workaround that adjusts the > > generation of prologue code and the frame indices used by > > the target eliminateFrameIndex() when necessary. I am looking at > > several approaches, but I would like input from anyone who > > has an opinion. > > > > I think to do this right for every target is a big job. I'd like to > see prologue / epilogue related stuff be moved out of > TargetRegisterInfo. Shrink wrapping will only happen when the targets > buy-in, i.e. providing the right hooks. >Part of what I'm doing now is estimating the work, which requires going through the targets. I am not far enough along to send out a proposal. Moving pro/epi generation out of TRI, perhaps into its own "component" is one architecture I am looking at.> When is shrink wrapping happening? Is it possible to do it after CSR > spills and restores are inserted but before FI are lowered into sp / > fp +/- offset?Shrink wrapping starts after calculateCalleeSavedRegisters(), which creates the list of CSRs used in the function. Shrink wrapping assigns MBB placements for spills and restores based on where they are used. calculateCalleeSavedRegisters() determines stack slots for the CSRs used in the function. I don't see an interaction between this and shrink wrapping, of have I missed something?> > Finally, I realize that shrink wrapping is probably not high > > priority in the larger scheme of LLVM development, so I'm not > > expecting > > a huge response, but any ideas are certainly welcome. > > It's actually a fairly useful optimization. It can really help a class > of functions, e.g. functions with early returns. >Quite right, it is certainly worthwhile. I could have left that comment out :-)> > > > The patch and a test tarball are attached. I include basic tests > > that are run with the supplied Makefile. > > Some comments: > > 1. The code needs some refactoring. :-) It's a big chunk of code so > it's hard to digest. > 2. There doesn't seem to be a way to turn off shrink wrapping. Please > make sure it's optional. When it's not being done, PEI should not > require dominator, etc. >I already refactored once, but I knew it would not be enough(!), I'll definitely do another pass. I forgot to put the analysis deps under --shrink-wrap, I will fix that and anything else that I might have left out of the option.> > From what I can see this is a very good first step. I look forward to > seeing its completion. > > Evan >Thanks! Likewise, and it's a pleasure to work on. John -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090301/70df295e/attachment.html>
Evan Cheng
2009-Mar-02 17:35 UTC
[LLVMdev] Shrink Wrapping - RFC and initial implementation
On Mar 1, 2009, at 2:57 PM, John Mosby wrote:> First, thanks very much for your comments! > > On Sat, Feb 28, 2009 at 8:05 PM, Evan Cheng <evan.cheng at apple.com> > wrote: > > On Feb 26, 2009, at 2:02 PM, John Mosby wrote: > > It is limited to X86 presently since that is the only target I have > > access to at the moment. > > What part of this is target dependent? Is this due to emitPrologue / > emitEpilogue being target specific? > > It is target dependent (X86) at present because of the way I > developed it, just using the X86 target since that is the only one > on which I can test the entire (static) flow: test.c -> llvm-gcc - > emit-llvm -> (.ll, .bc) -> llc --shrink-wrap -> .s -> gcc test.s -o > test. > > I worked with other targets also, but I decided to take it as far as > I could on the first go with X86. > > First pass was without debugging info and with simple stack frames, > in the interest of getting as much worked out as possible. > I saw the issue concerning how code gen handles placement of spill > and restore code outside of entry/return blocks before I had the > first test cases running, but I worked through the details using - > march=x86 only.Ok.> > Re: debugging info: I know about the work to change the way > debugging info is handled, so I held off trying to make the shrink > wrapping work with the current impl. > > > > The main features are: > > - Placing callee saved register (CSR) spills and restores in the > > CFG to tightly surround uses > > so that execution paths that do not use CSRs do not pay the > > spill/restore penalty. > > > > - Avoiding placment of spills/restores in loops: if a CSR is used > > inside a loop(nest), the spills > > are placed in the loop preheader, and restores are placed in > > the loop exit nodes (the > > successors of the loop _exiting_ nodes). > > > > - Covering paths without CSR uses: e.g. if a restore is placed in > > a join block, a matching spill > > is added to the end of all immediate predecessor blocks that > > are not reached by a spill. > > Similarly for saves placed in branch blocks. > > Sounds great. It would help everyone if you can show some examples > code. > > I am putting documented examples together from the test cases in the > patch.Yes, please. I think that will help to encourage discussion.> > > Since I ran into a non-trivial issue in developing this pass, I > > would like to submit my implementation as a "RFC + code/tests" > > rather than a typical contribution, and get people's opinions on how > > to proceed. > > > > The issue is that the code generator assumes all spills and restores > > of callee saved registers must be placed in the entry and return > > blocks resp. > > Shink wrapping violates this assumption, with the result that the > > frame offsets computed by PEI for frame-relative operands may be > > incorrect. > > I am not sure how this would happen. Why would frame offsets be > affected by where these instructions are placed? > > The issue is illustrated by a simple example in which a single CSR > is used in one branch of a conditional. When the stack frame is laid > out, the spill for this CSR is accounted for in the calculation of > stack size as it should be. The stack slot for the CSR is correctly > computed and everything seems fine when the MO_FrameIndex are > replaced. The problem is that since the spill instruction for the > CSR (e.g. pushl %esi) is moved from the entry block, the push does > not happen, and the value of %esp in the entry block is not what it > should be to satisfy the offsets produced by eliminateFrameIndex(). > A similar situation exists for the BB into which a spill is > "moved" (from the entry block): a push happens to spill the CSR on > entry to the block, and now %esp is not what it should be for that > block. The example below illustrates this issue:Ok. Then PEI should use store / load instead of push and pop and esp adjustment should be adjusted accordingly.> > assume: > int F(int a, int b, int c) uses one CSR in conditional branch > > prologue, no shrink wrapping: > > _F: > pushl %esi # spill CSR %csi, %esp -= 4 (in this > case) > subl $56, %esp # create frame, %esp = %esp - 56 > movl 64(%esp), %eax # fetch arg 'a' from entry %esp + 4 > movl %eax, 52(%esp) > movl 68(%esp), %eax # fetch arg 'b' > movl %eax, 48(%esp) > ... > > prologue with spill shrink-wrapped to basic block bb: > > _F: > # no spill of %esi, moved to bb > subl $56, %esp # create frame same as before %esp = %esp - > 56 > movl 64(%esp), %eax # error: 'a' is not at 64(%esp), it's at > 60(%esp) > movl %eax, 52(%esp) > ... > > The simple, ugly hack of adjusting the value by which %esp is > decremented in the prologue when one or more CSR spills have been > placed into other blocks takes care of the issue on this simple code > (no dynamic stack realign., nor EH) on x86. > > The companion hack for (non entry) MBBs into which spills have been > introduced is to adjust the stack size around > eliminateFrameIndex()'s for replacement of MO_FrameIndex operands. > > Obviously, all of this applies only when spills are done with push/ > pop, which is the case on x86. I used this issue to start looking at > generalizing how spills and restores are handled, before looking too > closely at other targets, and developed the workaround for the > initial implementation.Spills don't have to be done with push and pop. One possible approach to take is to perform them with store and load. Then later on PEI will optimize (some of) them into push and pop.> > > > > > My limited implementation uses a workaround that adjusts the > > generation of prologue code and the frame indices used by > > the target eliminateFrameIndex() when necessary. I am looking at > > several approaches, but I would like input from anyone who > > has an opinion. > > > > I think to do this right for every target is a big job. I'd like to > see prologue / epilogue related stuff be moved out of > TargetRegisterInfo. Shrink wrapping will only happen when the targets > buy-in, i.e. providing the right hooks. > > Part of what I'm doing now is estimating the work, which requires > going through the targets. I am not far enough along to send out a > proposal. Moving pro/epi generation out of TRI, perhaps into its own > "component" is one architecture I am looking at.It's not necessary to update all the targets at once. We can ask targets to opt in to take advantage of shrink wrapping.> > When is shrink wrapping happening? Is it possible to do it after CSR > spills and restores are inserted but before FI are lowered into sp / > fp +/- offset? > > Shrink wrapping starts after calculateCalleeSavedRegisters(), which > creates the list of CSRs used in the function. Shrink wrapping > assigns MBB placements for spills and restores based on where they > are used. calculateCalleeSavedRegisters() determines stack slots for > the CSRs used in the function. > I don't see an interaction between this and shrink wrapping, of have > I missed something?No.> > > > Finally, I realize that shrink wrapping is probably not high > > priority in the larger scheme of LLVM development, so I'm not > > expecting > > a huge response, but any ideas are certainly welcome. > > It's actually a fairly useful optimization. It can really help a class > of functions, e.g. functions with early returns. > > Quite right, it is certainly worthwhile. I could have left that > comment out :-) > > > > > > The patch and a test tarball are attached. I include basic tests > > that are run with the supplied Makefile. > > Some comments: > > 1. The code needs some refactoring. :-) It's a big chunk of code so > it's hard to digest. > 2. There doesn't seem to be a way to turn off shrink wrapping. Please > make sure it's optional. When it's not being done, PEI should not > require dominator, etc. > > I already refactored once, but I knew it would not be enough(!), > I'll definitely do another pass. I forgot to put the analysis deps > under --shrink-wrap, I will fix that and anything else that I might > have left out of the option.Ok, thanks.> > > From what I can see this is a very good first step. I look forward to > seeing its completion. > > Evan > > Thanks! Likewise, and it's a pleasure to work on.Good to hear. Thanks, Evan> > John > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090302/95669dca/attachment.html>
John Mosby
2009-Mar-03 04:21 UTC
[LLVMdev] Shrink Wrapping - RFC and initial implementation
On Mon, Mar 2, 2009 at 10:35 AM, Evan Cheng <echeng at apple.com> wrote:> > On Mar 1, 2009, at 2:57 PM, John Mosby wrote: > > Obviously, all of this applies only when spills are done with push/pop, > which is the case on x86. I used this issue to start looking at generalizing > how spills and restores are handled, before looking too closely at other > targets, and developed the workaround for the initial implementation. > > > Spills don't have to be done with push and pop. One possible approach to > take is to perform them with store and load. Then later on PEI will optimize > (some of) them into push and pop. >I am rewriting to use stores/loads via storeRegToStackSlot(), loadRegFromStackSlot(), which will obviate the stack adjustment. I did it this way in an early version, but the architecture is to give the target a chance to generate the spill(restore) code itself, so I conformed to that. As you point out, these can selectively be replaced with push/pop later.> Part of what I'm doing now is estimating the work, which requires going > through the targets. I am not far enough along to send out a proposal. > Moving pro/epi generation out of TRI, perhaps into its own "component" is > one architecture I am looking at. > > > It's not necessary to update all the targets at once. We can ask targets to > opt in to take advantage of shrink wrapping. >That sounds good. I now have a better handle on the differences between the targets: XCore handles frame moves in its spillCalleeSavedRegister(), the ARM code for this is very straightforward. I hope to have an updated patch done tomorrow, with doc. and explicated examples. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090302/aa079857/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Shrink Wrapping - RFC and initial implementation
- [LLVMdev] Shrink Wrapping - RFC and initial implementation
- [LLVMdev] Shrink Wrapping - RFC and initial implementation
- [LLVMdev] Shrink Wrapping - RFC and initial implementation
- [LLVMdev] Basic block liveouts