Joshua Thomas Wise via llvm-dev
2020-Mar-27 18:51 UTC
[llvm-dev] Efficient Green Thread Context-Switching
Hi LLVM devs, I’d like to describe my problem, and then propose new features of LLVM which would solve it efficiently. I'm building a new statically-compiled programming language, and I plan on using LLVM as the backend. The language will have a runtime with cooperatively managed green threads (user-space "mini-threads", each with their own dynamically allocated stack). A single OS thread will be responsible for managing many (perhaps millions) of green threads. Essentially, the program flow is like this: 1. Program is executing Green Thread 1 (GT1) 2. Program reaches a "yield point" which does three things: a. Saves all active registers to the stack. b. Calls a runtime function runtimeYield(). c. Pops the saved registers from the stack, restoring the original CPU state (we'll get to this later). 3. The runtimeYield function is called with a calling convention that stores the return address on top of the stack. 4. Inside runtimeYield, we record the stack pointer (%rsp) within a queue maintained by the scheduler (used to keep track of all existing green threads). 5. Now we need to resume some other green thread (GT2). First we set the stack pointer (%rsp) to reference GT2's stack. Then, we pop GT2’s return address from the top of its stack, and "jmp" to it, which should jump us to where GT2 previously yielded. 6. As GT2 resumes execution, it's on the second half of a "yield point", so it restores its original CPU state by popping saved registers from the stack. This model has the potential to be extremely efficient (perhaps more efficient than any other existing green thread implementation). But it can only achieve that efficiency through compiler support. This model is borrowed from a research paper (found here: https://dl.acm.org/doi/10.1145/2400682.2400695 <https://dl.acm.org/doi/10.1145/2400682.2400695>) published in 2013. The paper describes how, by designing the context switch as a compiler intrinsic instead of a runtime library function, many existing compiler optimizations can be automatically applied to user-space context switches. The paper shows the results of many benchmarks, all of which show this model heavily outperforming existing models of context switching, in both speed and memory usage. For example, it mentions that the POSIX functions makecontext() and swapcontext() save a structure that’s 936 bytes in size (768 on my machine), but their model generates context switches that save only 8-80 bytes of CPU state, depending on which existing optimizations are able to apply to the specific context switch. In some cases, the entire context switch is reduced to 1 or 2 instructions. Conveniently, the authors of that research paper actually implemented their model in a fork of LLVM! (found here: https://github.com/stedolan/llvm <https://github.com/stedolan/llvm>). The model is essentially implemented with just two additions: a new calling convention called “swapstack”, and a new intrinsic function called “newstack”. They demonstrate that these two primitives can be used not just for user-level context-switching, but also for implementing efficient coroutines and generators, all while taking advantage of existing LLVM optimizations. I encourage you to read the entire research paper, as it explains the implementation of “swapstack” and “newstack” in great detail. I also found a conversation in the mailing list from 2010, which shows someone else interested in the same functionality (found here: http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html <http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html>). I propose that the LLVM devs consider adding “swapstack” and “newstack” into the LLVM project officially. Thank you for taking the time to read this proposal, and thank you for all your hard work on LLVM. Thanks, - Joshua Wise -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200327/130c3c12/attachment-0001.html>
Joerg Sonnenberger via llvm-dev
2020-Mar-27 19:29 UTC
[llvm-dev] Efficient Green Thread Context-Switching
On Fri, Mar 27, 2020 at 01:51:48PM -0500, Joshua Thomas Wise via llvm-dev wrote:> The paper shows the results of many benchmarks, all of which show this > model heavily outperforming existing models of context switching, in > both speed and memory usage. For example, it mentions that the POSIX > functions makecontext() and swapcontext() save a structure that’s 936 > bytes in size (768 on my machine), but their model generates context > switches that save only 8-80 bytes of CPU state, depending on which > existing optimizations are able to apply to the specific context switch. > In some cases, the entire context switch is reduced to 1 or 2 instructions.Take a look at the the setjmp/longjmp intrinsics. IMO it is dishonest to compare with makecontext/swapcontext, which generally have to save the full FPU state and that's where the majority of the data is. Joerg
Joshua Thomas Wise via llvm-dev
2020-Mar-27 19:58 UTC
[llvm-dev] Efficient Green Thread Context-Switching
Sorry, I certainly didn't mean to be dishonest. I was just repeating one of the comparisons given by the research paper. Regardless, even setjmp() uses a structure of 148 bytes in size (on my machine). The main point is that with compiler support, many context switches can be easily reduced to just a few instructions and only 8 byte of memory, and even in the worst case they will outperform setjmp() by a factor of 2. That’s very significant for modern servers which could expect millions of contexts to exist simultaneously. It would also really help small IOT processors to multitask, which is something they’re normally bad at. I think the use-cases and efficiency achieved is compelling (especially as computing becomes more and more parallel/concurrent).> On Mar 27, 2020, at 2:29 PM, Joerg Sonnenberger via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > On Fri, Mar 27, 2020 at 01:51:48PM -0500, Joshua Thomas Wise via llvm-dev wrote: >> The paper shows the results of many benchmarks, all of which show this >> model heavily outperforming existing models of context switching, in >> both speed and memory usage. For example, it mentions that the POSIX >> functions makecontext() and swapcontext() save a structure that’s 936 >> bytes in size (768 on my machine), but their model generates context >> switches that save only 8-80 bytes of CPU state, depending on which >> existing optimizations are able to apply to the specific context switch. >> In some cases, the entire context switch is reduced to 1 or 2 instructions. > > Take a look at the the setjmp/longjmp intrinsics. IMO it is dishonest to > compare with makecontext/swapcontext, which generally have to save the > full FPU state and that's where the majority of the data is. > > Joerg > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Possibly Parallel Threads
- Efficient Green Thread Context-Switching
- Efficient Green Thread Context-Switching
- [LLVMdev] Proposal: stack/context switching within a thread
- [LLVMdev] Proposal: stack/context switching within a thread
- [LLVMdev] Proposal: stack/context switching within a thread