Hi All! I will be applying to the LLVM project for this GSoC, and I wanted some preliminary sanity check on my project idea. I intend to implement split (segmented) stacks for LLVM (like we have in Go, and as being implemented for GCC [1]). A lot of what follows is lifted from [1]; I will progressively add more details as I get more familiar with the LLVM codebase. I intend to start with the simplest possible approach - representing the stack as a doubly linked list of _block_s, the size of each _block_ being a power of two. This can later be modified to improve performance and accommodate other factors. Blocks will be chained together into a doubly linked list structure (using the first two words in the block as the next and previous pointers). In the prologue, a function will check whether the current block has enough stack space. This is easily done for function which don't have variable sized allocas, and for ones which do, we can assume some worst-case upper bound. The prologue can then call an intrinsic (let's call it llvm.adjust_stack) which allocates a new block (possibly by delegating this to a user-provided callback), copies the arguments, saves the previous stack pointer (in the new block), and adjusts the next and previous pointers. It will also have to adjust the stack pointer, and the frame pointer, if it is being maintained. Cleanup can be done by hijacking the return value, as also mentioned in [1]. It might make sense to leave the allocated blocks around, to prevent re-allocating the next time the program needs more stack space. DWARF info can be generated as follows: since we know the offset of base of the stack frame from the stack pointer (or we are maintaining a frame pointer), we can always say whether the concerned call frame is the first call frame or not. In the second case, all the previous register values can be computed as usual, and in the first case, we will add an extra indirection, involving looking up the stack pointer saved in this block's header. One thing I'd really like some input on is whether implementing split stacks would be useful enough to warrant the effort (especially keeping in mind that this is pretty useless on 64 bit architectures). [1] http://gcc.gnu.org/wiki/SplitStacks -- Sanjoy Das http://playingwithpointers.com
On Wed, Mar 23, 2011 at 03:37:02PM +0530, Sanjoy Das wrote:> I intend to start with the simplest possible approach - representing the > stack as a doubly linked list of _block_s, the size of each _block_ > being a power of two. This can later be modified to improve performance > and accommodate other factors. Blocks will be chained together into a > doubly linked list structure (using the first two words in the block as > the next and previous pointers).Where do you plan to store the pointers? What changes to the runtime environment does this require?> In the prologue, a function will check whether the current block has > enough stack space. This is easily done for function which don't have > variable sized allocas, and for ones which do, we can assume some > worst-case upper bound.If there are allocas involved, it is quite likely they are inside loops etc., in which case there are no simple stack boundaries. This shouldn't be a problem if the space for all static variables was allocated at the beginning.> The prologue can then call an intrinsic (let's > call it llvm.adjust_stack) which allocates a new block (possibly by > delegating this to a user-provided callback), copies the arguments, > saves the previous stack pointer (in the new block), and adjusts the > next and previous pointers.Why do you need to copy the arguments? In fact, why do you think you can actually copy the arguments? Consider printf for this.> It will also have to adjust the stack pointer, and the frame pointer, > if it is being maintained. Cleanup can be done by hijacking the return > value, as also mentioned in [1]. It might make sense to leave the > allocated blocks around, to prevent re-allocating the next time the > program needs more stack space.Hijacking the return value is a nice trick. I'm not sure about freeing / not-freeing the unused block, this again has implications for the runtime environment. Have you considered how much work call graph optimisations require? Especially with LTO it should be possible to kill many of the checks by computing used stack space across segments of the call graph. One of the papers on service scalability discussed this for light weight threads. Forgot which one, it's been a while.> One thing I'd really like some input on is whether implementing split > stacks would be useful enough to warrant the effort (especially keeping > in mind that this is pretty useless on 64 bit architectures).I don't think it is useless on 64bit architectures. You can't always make arbitrary large reservations of address space (e.g. optimised output of Chicken). They also don't come without a price. Also consider something like a kernel environment, where you really want to have a minimal default stack, but be able to fallback gracefully. Joerg
Hi! Thanks for the review. What follows is a revised version of the proposal. I'll post this on Melange after getting some more feedback. Segmented stacks for LLVM, GSoC '11 ----------------------------------- * Overview: Goal & Rationale Support segmented (split) stacks in LLVM. The idea is to allocate stack space to a thread incrementally, instead of allocating a (large) worst-case block of memory. This is especially important on 32 bit systems (where the virtual address space is limited). On 64 bit systems, (as Joerg pointed out), it will be useful in situations where the code uses a continuation passing style (like Chicken Scheme) or in critical software like kernels where it may not actually be possible to assign more than a certain amount of memory to the stack. In a situation where stacks can smaller in size than a single page, everyone wins. This summer, I plan to implement this for the x86 (including x86-64) platform, since that is the platform I'm most familiar with. * Overall Design The overall design of is described below. More concrete details have been put under the 'Implementation' section. ** The idea The central idea is to represent the stack as a doubly-linked list of memory blocks instead of a (large) contiguous chunk of memory. Each stack-frame will be made to fit entirely in one block, but one block may (and, in general, will) contain more than one stack frame. *** Block lifetime Blocks once allocated are not de-allocated - since they are still a part of the doubly-linked list (and hence, not /lost/), there will be no memory leak (and the blocks will be re-used the next time more memory is required). ** Issues with unwinding Dividing up the stack into a list of smaller blocks violates one important constraint: stack addresses are no longer continuous between function invocations. This may affect any runtime magic that involves unwinding the stack, like exception handling and scanning the stack for GC roots. However, assuming the unwinding routine uses the .eh_frame section, unwinding should work fine; since we'll also emit a correct .eh_frame. Edge cases (if they arise) will have to be eliminated on a case by case basis. * Implementation This section describes the proposed implementation in more detail. ** The block The block is a chunk of memory, whose size is a power of two (in bytes). It has four words reserved in the beginning of the block: 1. previous - Pointing to the previous block. 2. next - Pointing to the next block. 3. saved_ip - The original return address back to the callee. 4. saved_sp - The value of the SP when this block was created. ** Changes to the function prologue In its prologue, a function will check whether the current block has enough stack space. This is easily done for function which use a fixed amount of stack space. For the ones which don't, we can assume some worst-case upper bound. Since the block size is a constant power of two, this can be done using a bitwise and and an integer comparison. If there isn't enough space in the current block, the control jmps (does not call) to a naked function (called setup_new_block) *** setup_new_block: setup_new_block will be small naked function emitted into all programs being compiled with segmented stacks. It does the following: 1. Checks if the current block's next pointer is populated or not. 2. In case it isn't (i.e. is NULL), it allocates a new block, sets the next pointer to point to the new block. It then sets the 'previous' pointer of the new block to point to the old block. Otherwise, it directly goes to step 3. 3. Everything addressed relative to the stack pointer (in the function body) is copied onto the new block. This is so that the SP relative loads and stores work correctly. For x86 systems, this includes the function arguments passed on stack, a return address (not the original, mentioned below), the previous stack pointer. I don't think we will need to copy the actual argument values in case of a vararg function (we can't, in fact), since we'll be accessing the arguments using the fields in the va_list structure (and they point back to the previous block, where the arguments actually are). The return address placed on the new block is not the original return address placed on the old stack by the callee, but is the address of a another naked function (called cleanup_new_block) described below. 4. The stack pointer is set to (the address of the new block + 3 * word size). 5. Control is returned to the prolgue, and execution continues normally. We cannot use a usual ret here, so we'll have to use one of the caller-saved registers to communicate the address of the next instruction to setup_new_block. **** Frame pointers In case the frame pointer is being maintained for the function, the value at %rbp will also have to be copied to the new block, and %rbp will have to be changed to point to this new location (so that the FP relative indexing works). %rbp will then be set to its "correct" value by the epilogue itself. We'll probably need a new routine (set_new_block_fp) for this situation. **** Stack space for setup_new_block setup_new_block and setup_new_block_fp will be written to use the minimum amount of stack space possible, and that stack space will have to be factored into the check in the function prologue (so every function prologue checks for the space it needs _plus_ the space required to run setup_new_block once, by a callee). *** cleanup_new_block cleaup_new_block will be a simple operation, which restores the old SP (saved_sp block header) and jumps to saved_ip (also from the block header). Later, if we decide to free unused blocks then that cleanup can be done here. *** Eliminating allocations and checks It is evident that setup_new_block is too expensive to run in every prologue. In fact, it is probably worth the extra effort to eliminate the check in as many situations as possible. I was pointed to a paper earlier - `Whole-Program Optimization for Time and Space Efficient Threads', which talks about link-time determination of upper bounds on stack space required by threads. It does this by creating a call-graph and using the same techniques used for analyzing a program's control flow graph (dominators, back edges, et. al.). So, if the function A is dominated by the function B, and A needs 32 bytes of stack space, and B needs 64, we can allocate 96 bytes in B's prologue, and the check can be eliminated from A altogether. Even otherwise, if leaf functions F, G and H are reachable only via I and J, and F, G, H, I, J take f, g, h, i, j bytes of stack space respectively, we should be able to check for (MAX (f, g, h) + i) bytes in I and (MAX (f, g, h) + j) bytes in J; and eliminate the checks in F, G and H altogether. In my opinion, this should lead to substantial savings, especially keeping in mind the typical pattern of a module having a few publicly visible functions which delegate the core functionality to a battery of smaller functions with internal linkage and private visibility. *** The red zone The red zone can be usable only if there actually _is_ 128 bytes of stack space left in the current block. That will need to be factored into the check (in X86FrameLowering::emitPrologue). *** DWARF .eh_frame (This is largely unchanged from the previous version I sent on the list. I've repeated it here for completeness). DWARF info can be generated as follows: since we know the offset of base of the stack frame from the stack pointer (or we are maintaining a frame pointer), we can always say whether the concerned frame is the first frame in the block. If not, all the previous register values can be computed as usual, and in the first case, we can add an extra indirection, involving looking up the stack pointer saved in this block's header. *** Interoperability Clearly, we can't safely make calls from code using segmented-stacks to code expecting a vanilla contiguous stack. This can be fixed by the linker by padding calls into non-segmented-stack code from segmented-stack code with allocation and de-allocation of a 64 k block. For the linker to be able to do this, the compiler needs to emit a dummy section into object files compiled with segmented stacks. The linker will then use as a flag. This is exactly the approach mentioned in [1]. However, this approach is suboptimal, and to get full benefits of segmented stacks, all code needs to be compiled with segmented stacks. [1] http://gcc.gnu.org/wiki/SplitStacks -- Sanjoy Das http://playingwithpointers.com
I wonder - would something like this allow for multiple stacks for a single thread? I'm thinking of something like continuations / fibers / green threads, which would be very handy. On Wed, Mar 23, 2011 at 3:07 AM, Sanjoy Das <sanjoy at playingwithpointers.com>wrote:> Hi All! > > I will be applying to the LLVM project for this GSoC, and I wanted some > preliminary sanity check on my project idea. > > I intend to implement split (segmented) stacks for LLVM (like we have in > Go, and as being implemented for GCC [1]). A lot of what follows is > lifted from [1]; I will progressively add more details as I get more > familiar with the LLVM codebase. > > I intend to start with the simplest possible approach - representing the > stack as a doubly linked list of _block_s, the size of each _block_ > being a power of two. This can later be modified to improve performance > and accommodate other factors. Blocks will be chained together into a > doubly linked list structure (using the first two words in the block as > the next and previous pointers). > > In the prologue, a function will check whether the current block has > enough stack space. This is easily done for function which don't have > variable sized allocas, and for ones which do, we can assume some > worst-case upper bound. The prologue can then call an intrinsic (let's > call it llvm.adjust_stack) which allocates a new block (possibly by > delegating this to a user-provided callback), copies the arguments, > saves the previous stack pointer (in the new block), and adjusts the > next and previous pointers. It will also have to adjust the stack > pointer, and the frame pointer, if it is being maintained. Cleanup can > be done by hijacking the return value, as also mentioned in [1]. It > might make sense to leave the allocated blocks around, to prevent > re-allocating the next time the program needs more stack space. > > DWARF info can be generated as follows: since we know the offset of base > of the stack frame from the stack pointer (or we are maintaining a frame > pointer), we can always say whether the concerned call frame is the > first call frame or not. In the second case, all the previous register > values can be computed as usual, and in the first case, we will add an > extra indirection, involving looking up the stack pointer saved in this > block's header. > > One thing I'd really like some input on is whether implementing split > stacks would be useful enough to warrant the effort (especially keeping > in mind that this is pretty useless on 64 bit architectures). > > [1] http://gcc.gnu.org/wiki/SplitStacks > -- > Sanjoy Das > http://playingwithpointers.com > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110410/d41e8877/attachment.html>
On Apr 10, 2011, at 2:45 PM, Talin wrote:> I wonder - would something like this allow for multiple stacks for a single thread? I'm thinking of something like continuations / fibers / green threads, which would be very handy.I haven't looked at the proposal, but yes, this would be very useful functionality for LLVM to provide. -Chris> > On Wed, Mar 23, 2011 at 3:07 AM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote: > Hi All! > > I will be applying to the LLVM project for this GSoC, and I wanted some > preliminary sanity check on my project idea. > > I intend to implement split (segmented) stacks for LLVM (like we have in > Go, and as being implemented for GCC [1]). A lot of what follows is > lifted from [1]; I will progressively add more details as I get more > familiar with the LLVM codebase. > > I intend to start with the simplest possible approach - representing the > stack as a doubly linked list of _block_s, the size of each _block_ > being a power of two. This can later be modified to improve performance > and accommodate other factors. Blocks will be chained together into a > doubly linked list structure (using the first two words in the block as > the next and previous pointers). > > In the prologue, a function will check whether the current block has > enough stack space. This is easily done for function which don't have > variable sized allocas, and for ones which do, we can assume some > worst-case upper bound. The prologue can then call an intrinsic (let's > call it llvm.adjust_stack) which allocates a new block (possibly by > delegating this to a user-provided callback), copies the arguments, > saves the previous stack pointer (in the new block), and adjusts the > next and previous pointers. It will also have to adjust the stack > pointer, and the frame pointer, if it is being maintained. Cleanup can > be done by hijacking the return value, as also mentioned in [1]. It > might make sense to leave the allocated blocks around, to prevent > re-allocating the next time the program needs more stack space. > > DWARF info can be generated as follows: since we know the offset of base > of the stack frame from the stack pointer (or we are maintaining a frame > pointer), we can always say whether the concerned call frame is the > first call frame or not. In the second case, all the previous register > values can be computed as usual, and in the first case, we will add an > extra indirection, involving looking up the stack pointer saved in this > block's header. > > One thing I'd really like some input on is whether implementing split > stacks would be useful enough to warrant the effort (especially keeping > in mind that this is pretty useless on 64 bit architectures). > > [1] http://gcc.gnu.org/wiki/SplitStacks > -- > Sanjoy Das > http://playingwithpointers.com > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > -- > -- Talin > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110410/1265f704/attachment.html>
(Sorry for replying to the wrong post, but the original one had gone out of scope when this occurred to me.)> On Wed, Mar 23, 2011 at 3:07 AM, Sanjoy Das > <sanjoy at playingwithpointers.com>wrote: > >> In the prologue, a function will check whether the current block has >> enough stack space. This is easily done for function which don't have >> variable sized allocas, and for ones which do, we can assume some >> worst-case upper bound.What would you do if a function allocas in a loop? The programmer would know that the loop won't execute, say, more than twice, but your code doesn't unless you're lucky. Or if the size to be allocated is passed in as a parameter, from code that the compiler does not even see in this run (think dynamically linked code, or alloca sizes that depend on user input). So you'll need a fallback strategy for such situation. Either fall back to the standard stack (but that would more-or-less defeat the purpose of this all). Or make the code generators deal with stacks that have segmented per-function stack frames (whether that's even feasible, I have no idea, but I guess it wouldn't be without a gruesome amount of work, and this project would stay restricted to those backends that happen to get maintained for this). Or deal with such cases by redirecting all allocas to the heap. With the possible option of moving them back to the stack where the compiler can indeed infer the stack requirement beforehand. - - - In general, I think it's a good idea to avoid too large stack frames anyway. Operating systems and similar infrastructure tend to assume that the per-function stack requirement is limited to a small constant, and breaking that assumption can drive them into less well-tested code, triggering security holes or inefficiencies. Just my 2c, and back to lurker mode. Regards, Jo