thr3ads.net - llvm dev - [llvm-dev] Efficient Green Thread Context-Switching [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Joshua Thomas Wise via llvm-dev

2020-Mar-27 18:51 UTC

[llvm-dev] Efficient Green Thread Context-Switching

Hi LLVM devs,

I’d like to describe my problem, and then propose new features of LLVM which
would solve it efficiently.

I'm building a new statically-compiled programming language, and I plan on
using LLVM as the backend. The language will have a runtime with cooperatively
managed green threads (user-space "mini-threads", each with their own
dynamically allocated stack). A single OS thread will be responsible for
managing many (perhaps millions) of green threads.

Essentially, the program flow is like this:
1. Program is executing Green Thread 1 (GT1)
2. Program reaches a "yield point" which does three things:
  a. Saves all active registers to the stack.
  b. Calls a runtime function runtimeYield().
  c. Pops the saved registers from the stack, restoring the original CPU state
(we'll get to this later).
3. The runtimeYield function is called with a calling convention that stores the
return address on top of the stack.
4. Inside runtimeYield, we record the stack pointer (%rsp) within a queue
maintained by the scheduler (used to keep track of all existing green threads).
5. Now we need to resume some other green thread (GT2). First we set the stack
pointer (%rsp) to reference GT2's stack. Then, we pop GT2’s return address
from the top of its stack, and "jmp" to it, which should jump us to
where GT2 previously yielded.
6. As GT2 resumes execution, it's on the second half of a "yield
point", so it restores its original CPU state by popping saved registers
from the stack.

This model has the potential to be extremely efficient (perhaps more efficient
than any other existing green thread implementation). But it can only achieve
that efficiency through compiler support. This model is borrowed from a research
paper (found here: https://dl.acm.org/doi/10.1145/2400682.2400695
<https://dl.acm.org/doi/10.1145/2400682.2400695>) published in 2013. The
paper describes how, by designing the context switch as a compiler intrinsic
instead of a runtime library function, many existing compiler optimizations can
be automatically applied to user-space context switches.

The paper shows the results of many benchmarks, all of which show this model
heavily outperforming existing models of context switching, in both speed and
memory usage. For example, it mentions that the POSIX functions makecontext()
and swapcontext() save a structure that’s 936 bytes in size (768 on my machine),
but their model generates context switches that save only 8-80 bytes of CPU
state, depending on which existing optimizations are able to apply to the
specific context switch. In some cases, the entire context switch is reduced to
1 or 2 instructions.

Conveniently, the authors of that research paper actually implemented their
model in a fork of LLVM! (found here: https://github.com/stedolan/llvm
<https://github.com/stedolan/llvm>). The model is essentially implemented
with just two additions: a new calling convention called “swapstack”, and a new
intrinsic function called “newstack”. They demonstrate that these two primitives
can be used not just for user-level context-switching, but also for implementing
efficient coroutines and generators, all while taking advantage of existing LLVM
optimizations. I encourage you to read the entire research paper, as it explains
the implementation of “swapstack” and “newstack” in great detail.

I also found a conversation in the mailing list from 2010, which shows someone
else interested in the same functionality (found here:
http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html
<http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html>).

I propose that the LLVM devs consider adding “swapstack” and “newstack” into the
LLVM project officially.

Thank you for taking the time to read this proposal, and thank you for all your
hard work on LLVM.

Thanks,
- Joshua Wise

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200327/130c3c12/attachment-0001.html>

Joerg Sonnenberger via llvm-dev

2020-Mar-27 19:29 UTC

head link

[llvm-dev] Efficient Green Thread Context-Switching

On Fri, Mar 27, 2020 at 01:51:48PM -0500, Joshua Thomas Wise via llvm-dev
wrote:> The paper shows the results of many benchmarks, all of which show this
> model heavily outperforming existing models of context switching, in
> both speed and memory usage. For example, it mentions that the POSIX
> functions makecontext() and swapcontext() save a structure that’s 936
> bytes in size (768 on my machine), but their model generates context
> switches that save only 8-80 bytes of CPU state, depending on which
> existing optimizations are able to apply to the specific context switch.
> In some cases, the entire context switch is reduced to 1 or 2 instructions.
Take a look at the the setjmp/longjmp intrinsics. IMO it is dishonest to
compare with makecontext/swapcontext, which generally have to save the
full FPU state and that's where the majority of the data is.

Joerg

Joshua Thomas Wise via llvm-dev

2020-Mar-27 19:58 UTC

head link

[llvm-dev] Efficient Green Thread Context-Switching

Sorry, I certainly didn't mean to be dishonest. I was just repeating one of
the comparisons given by the research paper. Regardless, even setjmp() uses a
structure of 148 bytes in size (on my machine). The main point is that with
compiler support, many context switches can be easily reduced to just a few
instructions and only 8 byte of memory, and even in the worst case they will
outperform setjmp() by a factor of 2. That’s very significant for modern servers
which could expect millions of contexts to exist simultaneously. It would also
really help small IOT processors to multitask, which is something they’re
normally bad at. I think the use-cases and efficiency achieved is compelling
(especially as computing becomes more and more parallel/concurrent).
> On Mar 27, 2020, at 2:29 PM, Joerg Sonnenberger via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> On Fri, Mar 27, 2020 at 01:51:48PM -0500, Joshua Thomas Wise via llvm-dev
wrote:
>> The paper shows the results of many benchmarks, all of which show this
>> model heavily outperforming existing models of context switching, in
>> both speed and memory usage. For example, it mentions that the POSIX
>> functions makecontext() and swapcontext() save a structure that’s 936
>> bytes in size (768 on my machine), but their model generates context
>> switches that save only 8-80 bytes of CPU state, depending on which
>> existing optimizations are able to apply to the specific context
switch.
>> In some cases, the entire context switch is reduced to 1 or 2
instructions.
> 
> Take a look at the the setjmp/longjmp intrinsics. IMO it is dishonest to
> compare with makecontext/swapcontext, which generally have to save the
> full FPU state and that's where the majority of the data is.
> 
> Joerg
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Mar 2020 - Efficient Green Thread Context-Switching

[llvm-dev] Efficient Green Thread Context-Switching

[llvm-dev] Efficient Green Thread Context-Switching

[llvm-dev] Efficient Green Thread Context-Switching

Possibly Parallel Threads