Hi all: Below is a proposal to add experimental coroutine support to LLVM. Though this proposal is motivated primarily by the desire to support C++ Coroutines [1], the llvm representation is language neutral and can be used to support coroutines in other languages as well. Clang + llvm coroutines allows you to take this code: generator<int> range(int from, int to) { for(int i = from; i < to; ++n) co_yield i; } int main() { int sum = 0; for (auto v: range(1,100)) sum += v; return sum; } And translate it down to this: define i32 @main() #5 { entry: ret i32 4950 } I prototyped llvm changes to support this proposal and extended clang coroutine implementation [2] to emit proposed intrinsics to smoke test the proposed design and to get simple generator and async task style coroutines working. See "More Attention Required" in the doc/Coroutines.rst for details on what addition work is required beyond cleanup and bug fixes. I would like to start the discussion on the overall design, representation and a roadmap to start upstreaming the changes with the intention to continue the development in-tree incrementally. Roadmap: -------- 1) Get agreement on coroutine representation and overall direction (this mail). .. repeat 1) until happy 2) Get agreement on how coroutine transformation passes integrate into the optimizer pipeline. .. repeat 2) until happy 3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/index.rst 4) trickle in coroutine transformation passes + tests in digestible chunks. 5) get clang changes in (sometime after step 3). 6) fix bugs / remove limitations / add functionality to coroutine passes .. repeat 6) until happy Background =========Coroutine representation is the result of several offline discussions with Richard Smith, Reid Kleckner, and David Majnemer. All of the good came from the collective effort. All the yuckiness was sneaked in by me after those discussions. I'll start with a quick overview. LLVM coroutines are functions that have one or more `suspend points`. When a suspend point is reached, the execution of a coroutine is suspended and control is returned back to its caller. A suspended coroutine can be resumed to continue execution from the last suspend point or be destroyed. In the following example, function `f` (which may or may not be a coroutine itself) returns a handle to a suspended coroutine (**coroutine handle**) that is used by `main` to resume the coroutine twice and then destroy it: .. code-block:: llvm define i32 @main() { entry: %hdl = call i8* @f(i32 4) call void @llvm.experimental.coro.resume(i8* %hdl) call void @llvm.experimental.coro.resume(i8* %hdl) call void @llvm.experimental.coro.destroy(i8* %hdl) ret i32 0 } In addition to the function stack frame which exists when a coroutine is executing, there is an additional region of storage that contains values that keep the coroutine state when a coroutine is suspended. This region of storage is called **coroutine frame**. It is created when a coroutine is called. It is destroyed when a coroutine runs to completion or destroyed by a call to the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which maintains a stack per coroutine, an llvm coroutine frame only contains the values that need to persist across suspend points (there is a path from a use to the definition that crosses a suspend point). Overall idea is that a coroutine is presented to an llvm as an ordinary function with suspension points marked up with intrinsics. We let the optimizer party on the coroutine for awhile. Shortly before the coroutine is eligible to be inlined into its callers, we outline parts of the coroutine that correspond to the code that needs to get executed after the coroutine is resumed or destroyed. The coroutine resumption intrinsics get replaced with direct calls to those outlined functions where possible. Then inliner can inline much leaner and nicer coroutine or any of the outlined parts as desired. If we discover that the lifetime of a coroutine is fully enclosed in the lifetime of the caller, we remove dynamic allocation for coroutine frame and replace it with an `alloca` on the caller's frame. Please see doc/Coroutines.rst for more details: doc/Coroutines.rst: http://reviews.llvm.org/D21170 IR/Intrinsics.td: http://reviews.llvm.org/D21169 Concerns: ========(This section assumes that you looked at doc/Coroutins.rst and IR/Intrinsics.td) Invoke like intrinsics via call + conditional branch: ----------------------------------------------------- The `llvm.experimental.coro.fork` and `llvm.experimental.coro.suspend` intrinsics model behavior somewhat similar to the invoke instruction. Namely, they are used to show a normal continuation branch and an alternative branch. They are used in the following pattern: %where.to = call i1 @llvm.experimental.coro.XXXX() br i1 %where.to, label %normal, label %alternative My concern is that this is somewhat fragile. After optimizations some code can sneak in between the intrinsic and the branch and cause trouble. I am looking for suggestions on how to make it more robust. One thought I had was to put some intrinsics at the beginning of the %normal and %alternative blocks so as to prevent code motion above them. Something like: ... %where.to = call i1 @llvm.experimental.coro.XXXX() br i1 %where.to, label %normal, label %alternative normal: <phi> call i1 @llvm.experimental.coro.normal(); ... alternative: <phi> call i1 @llvm.experimental.coro.alt(); ... After coro.xxx is lowered, coro.normal and coro.alt are removed. Looking up outlined coroutine parts ----------------------------------- Last parameter of the `llvm.experimental.coro.init" intrinsic is a pointer to a coroutine function. To get to the outlined parts, I get the name of that function and glue ".resume", ".destroy" or ".cleanup" suffixes to it and then look them up by name. Probably a better alternative is for a coroutine f to create a constant containing an array of pointers to f.resume, f.destroy and other parts of the coroutine and use the last parameter to refer to that constant. Other questions --------------- I have more questions about the best way of plugging in coroutine optimization passes into the PassManager, but, I will defer those for later discussion. All the feedback and comments is greatly appreciated!!! Thank you, Gor References: ========== [1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf [2] http://llvmweekly.org/issue/95 (Coroutine support added to clang) [3] http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html Review Links: ============ doc/Coroutines.rst: http://reviews.llvm.org/D21170 IR/Intrinsics.td: http://reviews.llvm.org/D21169
On 9 Jun 2016, at 06:57, Gor Nishanov via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Though this > proposal is motivated primarily by the desire to support C++ Coroutines [1], > the llvm representation is language neutral and can be used to support > coroutines in other languages as well.I’m sorry if this is off-topic, but has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage? In LLVM-compiled code, it is possible that the address of a thread-local variable will be cached to avoid repeated lookups. In an environment that contains both coroutines and TLS, it is possible that the coroutine will be migrated to a different thread between suspension and resumption. Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables? Do we have to treat every function call as a suspension point, as no function can know whether it is executing in a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if it would cause noticeable performance regressions in code that makes heavy use of thread-local variables. David
On Wed, Jun 8, 2016 at 10:57 PM, Gor Nishanov via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi all: > > Below is a proposal to add experimental coroutine support to LLVM. Though > this > proposal is motivated primarily by the desire to support C++ Coroutines > [1], > the llvm representation is language neutral and can be used to support > coroutines in other languages as well. >I looked over the proposal... a few comments: 1. This adds a lot of intrinsics, and they have weird semantics. This is bad in general; the more weirdness you add to the IR, the more you'll struggle to make the optimizer actually understand what you're doing. And magic pattern matching *will* break. You've already pointed out the issue with branches. (It's actually worse than you think: there isn't any guarantee there will even be a branch when you've thrown the optimizer at it.) The save intrinsic says "Its return value should be consumed by exactly one `coro.suspend` intrinsic.", but it might have zero uses by the time your lowering pass runs. You could end up with code getting inserted before llvm.experimental.coro.fork. 2. If you need some sort of unusual control flow construct, make it a proper terminator instruction; don't try to glue your intrinsic to a normal branch instruction. A new kind of terminator instruction might be appropriate. Or maybe you can make "invoke" work for you. 3. Maybe you could reduce the amount of magic involved by splitting the initialization into a separate function? So the initialization is always just something like "f(int x) { frame = init(); f_inner(x, frame); return frame; }", and you don't have to deal with the weirdness of fork(). -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160609/5794ab5f/attachment.html>
Hi David:>> has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage?There is a discussion in SG1 subgroup (Concurrency and Parallelism) on the meaning of TLS in the lightweight thread of execution. I don't think there is a resolution yet. Whether to think of a coroutine of this kind as a thread of execution or just an "unusual" function is not decided either.>> Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables?You are right! P0057 implies (probably need to make a note about it) that when a coroutine is executing, you get thread_local data for the current thread. For your first point (caching TLS loads), I think I am getting it for free. coro.suspend intrinsic, from llvm standpoint, may read and write any memory it can get access to, therefore, if a value was loaded from TLS before coro.suspend, it would have to be reloaded after a call to coro.suspend. With respect to caching of addresses of TLS variables. 1) if a user does the caching, we should let him: thread_local int a; task f() { auto x = &a; // captures the address of a in Thread 1 co_await some_async_call(); *x = *x + 1; // updates the value in Thread1, whatever the current thread is } 2) If llvm is does it, than, absolutely, we should prevent "prevent caching of loads of addresses across suspend points". I poked around a little and was not able to come up with the case where it would, but, I'll keep looking. If you know of such a case, let me know. Coroutines are split into normal functions in the middle optimizer, so if there is any caching done during lowering, coroutines won't be affected.>> Do we have to treat every function call as a suspension point, as no function can know whether it is executing in >> a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if >> it would cause noticeable performance regressions in code that makes heavy use of thread-local variables.That should not be necessary. The proposed coroutines are "stackless", they can only get suspended when control is in the body of the coroutine at explicitly marked suspend points. Thus thread migration can only happen at suspend point boundaries. With stackful coroutines (aka Fibers, Green Threads, User-Mode threads, etc), it is a big concern, since none of the functions that runs on a fiber are aware that they run on a fiber and potentially can get a thread switch at every call. Yep. TLS is pretty much busted there. Thank you very much for your comments!!! On Thu, Jun 9, 2016 at 1:25 AM, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote:> On 9 Jun 2016, at 06:57, Gor Nishanov via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Though this >> proposal is motivated primarily by the desire to support C++ Coroutines [1], >> the llvm representation is language neutral and can be used to support >> coroutines in other languages as well. > > I’m sorry if this is off-topic, but has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage? In LLVM-compiled code, it is possible that the address of a thread-local variable will be cached to avoid repeated lookups. In an environment that contains both coroutines and TLS, it is possible that the coroutine will be migrated to a different thread between suspension and resumption. Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables? Do we have to treat every function call as a suspension point, as no function can know whether it is executing in a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if it would cause noticeable performance regressions in code that makes heavy use of thread-local variables. > > David >
Hi Eli: Thank you very much for your comments!>> If you need some sort of unusual control flow construct, make it a >> proper terminator instructionI would love to. I was going by the advice in "docs/ExtendingLLVM.rst": "WARNING: Adding instructions changes the bitcode format, and it will take some effort to maintain compatibility with the previous version. Only add an instruction if it is absolutely necessary. Having coro.fork and coro.suspend to be proper terminators (as they are) will be much cleaner. Something like: corobegin to label %start suspend label %retblock corosuspend [final] [save %token] resume label %resume cleanup label %cleanup I did consider "repurposing" 'invoke', but felt it is too much of an abuse. If there is a support for making corobegin/corosuspend instructions, I would love to do it. Also, if there is a support for invoke abuse :-), I will be glad to try it out and come back with the report on how it went.>> Maybe you could reduce the amount of magic involved by splitting the >> initialization into a separate function? So the initialization is always >> just something like "f(int x) { frame = init(); f_inner(x, frame); return >> frame; }", and you don't have to deal with the weirdness of fork().I think outlining the "start" part will be sufficient (starting at the coro.fork and cutting all of the branches ending in coro.end). In this case llvm won't see coro.fork and coro.end as the clang will do the outlining work. Something like T f(int x, A y) { auto mem = coro.elide(); if (!mem) mem = malloc(llvm.frame.size(&f_inner)); auto hdl = coro.init(mem); T result = <whatever>; f_inner(hdl, x, &y); return result; } f will be a normal function and f_inner will be magic function that will require to be split at suspend points. It will be processed in the IPO order, so by the time f_inner is considered for inlining into f, it will be a normal function. I think this can work, but we are trading off one complexity for another. As presented: coroutine is defined as a magic function (that has weird intrinsics) and requires splitting during IPO With coro.fork / coro.end removed coroutine is defined as a combination of normal function + magic function that requires splitting during IPO. If there are other languages that acquire coroutines they would need to have the same outlining pass as clang. I need to think about it more, but, I am currently leaning toward the first option (with coro.fork and coro.end). The "f_inner" is still part of the f, but nicely delineated with corobegin and coro.end. corobegin to label %coro.start suspend label %retblock corosuspend [final] [save %token] resume label %resume cleanup label %cleanup call void @llvm.coro.end(); Does it look better? Gor On Thu, Jun 9, 2016 at 1:33 AM, Eli Friedman <eli.friedman at gmail.com> wrote:> On Wed, Jun 8, 2016 at 10:57 PM, Gor Nishanov via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> Hi all: >> >> Below is a proposal to add experimental coroutine support to LLVM. Though >> this >> proposal is motivated primarily by the desire to support C++ Coroutines >> [1], >> the llvm representation is language neutral and can be used to support >> coroutines in other languages as well. > > > I looked over the proposal... a few comments: > > 1. This adds a lot of intrinsics, and they have weird semantics. This is > bad in general; the more weirdness you add to the IR, the more you'll > struggle to make the optimizer actually understand what you're doing. And > magic pattern matching *will* break. You've already pointed out the issue > with branches. (It's actually worse than you think: there isn't any > guarantee there will even be a branch when you've thrown the optimizer at > it.) The save intrinsic says "Its return value should be consumed by > exactly one `coro.suspend` intrinsic.", but it might have zero uses by the > time your lowering pass runs. You could end up with code getting inserted > before llvm.experimental.coro.fork. > 2. If you need some sort of unusual control flow construct, make it a proper > terminator instruction; don't try to glue your intrinsic to a normal branch > instruction. A new kind of terminator instruction might be appropriate. Or > maybe you can make "invoke" work for you. > 3. Maybe you could reduce the amount of magic involved by splitting the > initialization into a separate function? So the initialization is always > just something like "f(int x) { frame = init(); f_inner(x, frame); return > frame; }", and you don't have to deal with the weirdness of fork(). > > -Eli
Sorry for the novice question, but how are coroutines lowered in the backend? It requires making operating system calls to something like pthreads right? On Wed, Jun 8, 2016 at 11:57 PM, Gor Nishanov via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi all: > > Below is a proposal to add experimental coroutine support to LLVM. Though > this > proposal is motivated primarily by the desire to support C++ Coroutines > [1], > the llvm representation is language neutral and can be used to support > coroutines in other languages as well. > > Clang + llvm coroutines allows you to take this code: > > generator<int> range(int from, int to) { > for(int i = from; i < to; ++n) > co_yield i; > } > int main() { > int sum = 0; > for (auto v: range(1,100)) > sum += v; > return sum; > } > > And translate it down to this: > > define i32 @main() #5 { > entry: > ret i32 4950 > } > > I prototyped llvm changes to support this proposal and extended clang > coroutine > implementation [2] to emit proposed intrinsics to smoke test the proposed > design and to get simple generator and async task style coroutines working. > See "More Attention Required" in the doc/Coroutines.rst for details on what > addition work is required beyond cleanup and bug fixes. > > I would like to start the discussion on the overall design, representation > and > a roadmap to start upstreaming the changes with the intention to continue > the > development in-tree incrementally. > > Roadmap: > -------- > 1) Get agreement on coroutine representation and overall direction (this > mail). > .. repeat 1) until happy > 2) Get agreement on how coroutine transformation passes integrate into the > optimizer pipeline. > .. repeat 2) until happy > 3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/index.rst > 4) trickle in coroutine transformation passes + tests in digestible chunks. > 5) get clang changes in (sometime after step 3). > 6) fix bugs / remove limitations / add functionality to coroutine passes > .. repeat 6) until happy > > Background > =========> Coroutine representation is the result of several offline discussions > with Richard Smith, > Reid Kleckner, and David Majnemer. All of the good came from the > collective effort. > All the yuckiness was sneaked in by me after those discussions. > > I'll start with a quick overview. > > LLVM coroutines are functions that have one or more `suspend points`. > When a suspend point is reached, the execution of a coroutine is suspended > and > control is returned back to its caller. A suspended coroutine can be > resumed > to continue execution from the last suspend point or be destroyed. > > In the following example, function `f` (which may or may not be a > coroutine itself) > returns a handle to a suspended coroutine (**coroutine handle**) that is > used > by `main` to resume the coroutine twice and then destroy it: > > .. code-block:: llvm > > define i32 @main() { > entry: > %hdl = call i8* @f(i32 4) > call void @llvm.experimental.coro.resume(i8* %hdl) > call void @llvm.experimental.coro.resume(i8* %hdl) > call void @llvm.experimental.coro.destroy(i8* %hdl) > ret i32 0 > } > > In addition to the function stack frame which exists when a coroutine > is executing, > there is an additional region of storage that contains values that keep the > coroutine state when a coroutine is suspended. This region of storage > is called **coroutine frame**. It is created when a coroutine is called. > It is > destroyed when a coroutine runs to completion or destroyed by a call to > the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which > maintains a > stack per coroutine, an llvm coroutine frame only contains the values that > need > to persist across suspend points (there is a path from a use to the > definition > that crosses a suspend point). > > Overall idea is that a coroutine is presented to an llvm as an ordinary > function > with suspension points marked up with intrinsics. We let the optimizer > party > on the coroutine for awhile. Shortly before the coroutine is eligible to be > inlined into its callers, we outline parts of the coroutine that > correspond to > the code that needs to get executed after the coroutine is resumed or > destroyed. > The coroutine resumption intrinsics get replaced with direct calls to those > outlined functions where possible. Then inliner can inline much leaner and > nicer > coroutine or any of the outlined parts as desired. > > If we discover that the lifetime of a coroutine is fully enclosed in > the lifetime > of the caller, we remove dynamic allocation for coroutine frame and > replace it > with an `alloca` on the caller's frame. > > Please see doc/Coroutines.rst for more details: > > doc/Coroutines.rst: http://reviews.llvm.org/D21170 > IR/Intrinsics.td: http://reviews.llvm.org/D21169 > > Concerns: > ========> (This section assumes that you looked at doc/Coroutins.rst and > IR/Intrinsics.td) > > Invoke like intrinsics via call + conditional branch: > ----------------------------------------------------- > > The `llvm.experimental.coro.fork` and `llvm.experimental.coro.suspend` > intrinsics model behavior somewhat similar to the invoke instruction. > Namely, > they are used to show a normal continuation branch and an alternative > branch. > > They are used in the following pattern: > > %where.to = call i1 @llvm.experimental.coro.XXXX() > br i1 %where.to, label %normal, label %alternative > > My concern is that this is somewhat fragile. After optimizations some code > can > sneak in between the intrinsic and the branch and cause trouble. I am > looking > for suggestions on how to make it more robust. One thought I had was to > put some > intrinsics at the beginning of the %normal and %alternative blocks so as to > prevent code motion above them. Something like: > > ... > %where.to = call i1 @llvm.experimental.coro.XXXX() > br i1 %where.to, label %normal, label %alternative > normal: > <phi> > call i1 @llvm.experimental.coro.normal(); > ... > alternative: > <phi> > call i1 @llvm.experimental.coro.alt(); > ... > > After coro.xxx is lowered, coro.normal and coro.alt are removed. > > Looking up outlined coroutine parts > ----------------------------------- > > Last parameter of the `llvm.experimental.coro.init" intrinsic is a pointer > to > a coroutine function. To get to the outlined parts, I get the name of that > function and glue ".resume", ".destroy" or ".cleanup" suffixes to it and > then > look them up by name. > > Probably a better alternative is for a coroutine f to create a constant > containing an array of pointers to f.resume, f.destroy and other parts of > the > coroutine and use the last parameter to refer to that constant. > > Other questions > --------------- > > I have more questions about the best way of plugging in coroutine > optimization > passes into the PassManager, but, I will defer those for later discussion. > > All the feedback and comments is greatly appreciated!!! > > Thank you, > Gor > > References: > ==========> > [1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf > [2] http://llvmweekly.org/issue/95 (Coroutine support added to clang) > [3] > http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html > > Review Links: > ============> > doc/Coroutines.rst: http://reviews.llvm.org/D21170 > IR/Intrinsics.td: http://reviews.llvm.org/D21169 > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160610/e1975b50/attachment.html>
Hi Josh:> Sorry for the novice question, but how are coroutines lowered in the > backend? It requires making operating system calls to something like > pthreads right?No OS, no target specific code required. Just simple rewriting of a function during the middle optimizer. One of the first examples in the RFC shows that a coroutine is optimized down to a constant. Try doing that with pthreads :-)>> generator<int> range(int from, int to) { >> for(int i = from; i < to; ++n) >> co_yield i; >> } >> int main() { >> int sum = 0; >> for (auto v: range(1,100)) >> sum += v; >> return sum; >> } >> >> And translate it down to this: >> >> define i32 @main() #5 { >> entry: >> ret i32 4950 >> }Very briefly: LLVM coroutines are functions that have one or more `suspend points`_. When a suspend point is reached, the execution of a coroutine is suspended and control is returned back to its caller. A suspended coroutine can be resumed to continue execution from the last suspend point or be destroyed. Let's say you get coroutine that looks something like this void *f(int n) { for(;;) { bar(n++); <suspend> // magic: returns a coroutine handle on first suspend } } We will split it up into start, resume and destroy functions and will create a coroutine frame that will keep values that need to be live across suspension points. struct f.frame { int n; } void* f(int n) { auto s = new f.frame{n}; bar(s->n++); return s; } void f.resume(f.frame* s) { bar(s->n++); } void f.destroy(f.frame* s) { delete s; } You can read more details in the doc/Coroutines.rst up for review at http://reviews.llvm.org/D21170 or in the RFC that started this thread. Cheers Gor