On Fri, Jun 10, 2016 at 5:25 PM, Gor Nishanov <gornishanov at gmail.com> wrote:> Hi Eli: > > >> Naively, you would expect that it would be legal to hoist the store... > >> but that breaks your coroutine semantics because the global could be > mutated > >> between the first return and the resume. > > Hmmm... I don't see the problem. I think hoisting the store is perfectly > legal > transformation with all semantics preserved. > > Let's look at your example: > > >> block1: > >> suspend > >> switch (resume, destroy, return) > >> > >> resume: > >> store zero to global @g > >> doA() > >> [...] > >> > >> destroy: > >> store zero to global @g > >> doB() > >> [...] > >> > >> return: > >> store zero to global @g > >> doC > >> [...] > > As written the behavior is: > > 1) when we encounter a suspend during the first pass through the > function, > store 0 to @g and doC() > > 2) when we encounter a suspend after coroutine was resumed > ret void > > 3) When coroutine is resumed: > store 0 to @g and doA() > > 4) When coroutine is destroyed: > store 0 to @g and doB() > > If this is a coroutine that can be resumed asynchronously from a different > thread there is a data race. For example, if resume happens before 'f' > returns, > doA() can write something useful to @g, and 'f' will clobber it with zero. > But, this race is already present in the code and is not introduced by > LLVM. > > Let's hoist the store and see what happens: > > >> block1: > >> suspend > >> store zero to global @g > >> switch (resume, destroy, return) > >> > >> resume: > >> doA() > >> [...] > >> > >> destroy: > >> doB() > >> [...] > >> > >> return: > >> doC() > >> [...] > > Now, let's split it up: > 1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy > 2. remove coro.suspend in f, replace with ret void in f.resume > > void f() { > [...] > store zero to global @g > doC(); > [...] > } > > void @f.resume() { > entry: > store zero to global @g > doA(); > [....] > } > > void @f.destroy() { > entry: > store zero to global @g > doB(); > [....] > } > > Behavior looks exactly as before. What am I missing? > >Err... I guess nothing; sorry, I got myself confused. Essentially executing a return statement, then jumping back, seems wrong, but I'm having trouble coming up with a good example of how it would actually break something. I guess there aren't really any issues on a pure control-flow basis: you can't execute a global side-effect a different number of times than it would otherwise happen. You might run into problems with allocas: LLVM's optimizer will assume the lifetime of any alloca in the current function ends when you hit a "ret" instruction, so jumping back from the ret to the middle of the function could lead to wrong results. Silly example: x = alloca... block1: suspend ; increment could get moved here. switch (resume, destroy, return) resume: x += 1 [...] destroy: x += 1 [...] return: (don't touch x) [...] The increment is only supposed to execute once, but instead executes twice. This isn't a great example... and I'm not sure issues are limited to allocas... but that's the idea, anyway. -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160610/39f19df7/attachment.html>
Hi Eli:>> Err... I guess nothing; sorry, I got myself confused.That is my fault :-). I confused myself in the beginning where I was trying to treat coro.suspend + cond.branch pair as a terminator and was looking for branches. You showed me the way... Just ignore the branches and represent the control flow cleanly and SimplifyCFG will take care of the rest :-).>> You might run into problems with allocas:>> x = alloca... >> >> block1: >> suspend >> ; increment could get moved here. >> switch (resume, destroy, return) >> >> resume: >> x += 1 ; load + inc + store >> [...] >> >> destroy: >> x += 1 ; load + inc + store >> [...] >> >> return: >> (don't touch x) >> [...]Yep. You are right. Need some kind of barrier to prevent the motion. Okay. New intrinsics: declare void coro.barrier(i64); // will get a unique number for every insertion>> x = alloca... >> >> block1: >> suspend >> ; nothing can sneak in here???, Right? >> switch (resume, destroy, return) >> >> resume: >> coro.barrier(0) >> x += 1 ; load + inc + store >> [...] >> >> destroy: >> coro.barrier(1) >> x += 1 ; load + inc + store >> [...] >> >> return: >> coro.barrier(2) >> (don't touch x) >> [...]Now we are good, right? I am still not giving up hope for an intrinsic approach. Thanks a lot, Eli, for your help with the design! Gor P.S There is always a backup plan to go to your "two function" idea with a slight twist. Almost two functions (coro.fork(), coro.suspendO() same as in original proposal) =============================================================================== CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using coro.fork as a guide (since false branch bypasses the coroutine body). Coroutine will now travel as two functions until CoroSplit gets to f.start (Probably will use 'coroutine' attribute on a function to indicate that it requires CoroSplit to munge on it.) I am still hoping we can find a solution without requiring to separate f.start out. Seems like a huge hammer to avoid undesirable code motion / hoisting. (I really hope that @coro.barrier(i64) approach can work) On Fri, Jun 10, 2016 at 7:40 PM, Eli Friedman <eli.friedman at gmail.com> wrote:> On Fri, Jun 10, 2016 at 5:25 PM, Gor Nishanov <gornishanov at gmail.com> wrote: >> >> Hi Eli: >> >> >> Naively, you would expect that it would be legal to hoist the store... >> >> but that breaks your coroutine semantics because the global could be >> >> mutated >> >> between the first return and the resume. >> >> Hmmm... I don't see the problem. I think hoisting the store is perfectly >> legal >> transformation with all semantics preserved. >> >> Let's look at your example: >> >> >> block1: >> >> suspend >> >> switch (resume, destroy, return) >> >> >> >> resume: >> >> store zero to global @g >> >> doA() >> >> [...] >> >> >> >> destroy: >> >> store zero to global @g >> >> doB() >> >> [...] >> >> >> >> return: >> >> store zero to global @g >> >> doC >> >> [...] >> >> As written the behavior is: >> >> 1) when we encounter a suspend during the first pass through the >> function, >> store 0 to @g and doC() >> >> 2) when we encounter a suspend after coroutine was resumed >> ret void >> >> 3) When coroutine is resumed: >> store 0 to @g and doA() >> >> 4) When coroutine is destroyed: >> store 0 to @g and doB() >> >> If this is a coroutine that can be resumed asynchronously from a different >> thread there is a data race. For example, if resume happens before 'f' >> returns, >> doA() can write something useful to @g, and 'f' will clobber it with zero. >> But, this race is already present in the code and is not introduced by >> LLVM. >> >> Let's hoist the store and see what happens: >> >> >> block1: >> >> suspend >> >> store zero to global @g >> >> switch (resume, destroy, return) >> >> >> >> resume: >> >> doA() >> >> [...] >> >> >> >> destroy: >> >> doB() >> >> [...] >> >> >> >> return: >> >> doC() >> >> [...] >> >> Now, let's split it up: >> 1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy >> 2. remove coro.suspend in f, replace with ret void in f.resume >> >> void f() { >> [...] >> store zero to global @g >> doC(); >> [...] >> } >> >> void @f.resume() { >> entry: >> store zero to global @g >> doA(); >> [....] >> } >> >> void @f.destroy() { >> entry: >> store zero to global @g >> doB(); >> [....] >> } >> >> Behavior looks exactly as before. What am I missing? >> > > Err... I guess nothing; sorry, I got myself confused. Essentially executing > a return statement, then jumping back, seems wrong, but I'm having trouble > coming up with a good example of how it would actually break something. I > guess there aren't really any issues on a pure control-flow basis: you can't > execute a global side-effect a different number of times than it would > otherwise happen. > > You might run into problems with allocas: LLVM's optimizer will assume the > lifetime of any alloca in the current function ends when you hit a "ret" > instruction, so jumping back from the ret to the middle of the function > could lead to wrong results. Silly example: > > x = alloca... > > block1: > suspend > ; increment could get moved here. > switch (resume, destroy, return) > > resume: > x += 1 > [...] > > destroy: > x += 1 > [...] > > return: > (don't touch x) > [...] > > The increment is only supposed to execute once, but instead executes twice. > > This isn't a great example... and I'm not sure issues are limited to > allocas... but that's the idea, anyway. > > -Eli >
On Fri, Jun 10, 2016 at 10:32 PM, Gor Nishanov <gornishanov at gmail.com> wrote:> Okay. New intrinsics: > > declare void coro.barrier(i64); // will get a unique number for every > insertion > > >> x = alloca... > >> > >> block1: > >> suspend > >> ; nothing can sneak in here???, Right? > >> switch (resume, destroy, return) > >> > >> resume: > >> coro.barrier(0) > >> x += 1 ; load + inc + store > >> [...] > >> > >> destroy: > >> coro.barrier(1) > >> x += 1 ; load + inc + store > >> [...] > >> > >> return: > >> coro.barrier(2) > >> (don't touch x) > >> [...] > > Now we are good, right? > I am still not giving up hope for an intrinsic approach. >coro.barrier() doesn't work: if the address of the alloca doesn't escape, alias analysis will assume the barrier can't read or write the value of the alloca, so the barrier doesn't actually block code movement. There is always a backup plan to go to your "two function" idea with a> slight > twist. > > > Almost two functions (coro.fork(), coro.suspendO() same as in original > proposal) > > ===============================================================================> > CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using > coro.fork as a guide (since false branch bypasses the coroutine body). > > Coroutine will now travel as two functions until CoroSplit gets to f.start > (Probably will use 'coroutine' attribute on a function to indicate that it > requires CoroSplit to munge on it.) > > I am still hoping we can find a solution without requiring to separate > f.start > out. Seems like a huge hammer to avoid undesirable code motion / hoisting. > > (I really hope that @coro.barrier(i64) approach can work) >If you really don't want two functions, you could probably do something like this: block1: %first_time = load... br i1 %first_time, label return, label suspend1 suspend1: suspend switch (resume1, destroy1) block2: %first_time = load... br i1 %first_time, label return, label suspend2 suspend2: suspend switch (resume2, destroy2) label return: %cont = phi i32 [(block1, 1), (block2, 2)] ; do stuff switch %cont... (1, label suspend1), (2, label suspend2) It's a little awkward to generate the return block, but otherwise it's reasonably clean. Also, if some non-C++ language wants to generate coroutines, it might not have to generate the return block at all. -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160611/8a77eb6a/attachment.html>
On Sat, Jun 11, 2016 at 5:09 PM, Gor Nishanov <gornishanov at gmail.com> wrote:> HI Eli: > > >> coro.barrier() doesn't work: if the address of the alloca doesn't > escape, > >> alias analysis will assume the barrier can't read or write the value of > >> the alloca, so the barrier doesn't actually block code movement. > > Got it. I am new to this and learning a lot over the course > of this thread. Thank you for being patient with me. > > Two questions and one clarification: > > Q1: Do we have to have a load here? > ==================================> > >> block1: > >> %first_time = load... <--- What are we loading here? >Just an local alloca, initialized to false, and changed to true in the return block.> >> br i1 %first_time, label return, label suspend1 > >> > >> supend1: > >> %0 = coro.suspend() > >> switch %0 (resume1, destroy1) > > Can we use three way coro.suspend instead? > > Block1: > %0 = call i8 coro.suspend() > switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1 > Suspend1: > switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1 >This doesn't look right: intuitively the suspend happens after the return block runs.> One problem I can see is that someone can write a pass that might merge > two branches / switches into one switch and we are back where we were. > I guess what you meant by load, is to call some coro.is.first.time() > intrinsic. > So it looks like: > > >> block1: > >> %first_time = call i1 coro.is.first.time() > >> br i1 %first_time, label return, label suspend1 > >> > >> supend1: > >> %0 = coro.suspend() > >> switch %0 (resume1, destroy1) > > This looks fine, there may be more uses for this intrinsic in the frontend. > Killing two birds with one stone. Good. >It doesn't really matter whether the bit gets tracked in an alloca or through intrinsics.> > Question 2: Why the switch in the return block? > ==============================================> > I would think that **pre-split** return block would be simply: > > return: > <run dtors for parameters, if required> > <conversion ops for ret value, if required> > <ret void> or <ret whatever> > > Where and why I should put a switch that you mentioned in this return > block? > > BTW, I am speaking of the return block as if it is one block, > but, it could be a dominating block over all the blocks that together > run the destructors, do return value conversion, etc. >The best way to be sure the compiler will understand the control flow is if the coroutine acts like a normal function. Another way to put it is that it should be possible to lower a coroutine to a thread rather than performing the state machine transformation. The switch answers the question of where the control flow actually goes after the return block runs. Under normal function semantics, the "return" block doesn't actually return: it just performs the one-time operations, then jumps back to the suspend call. Therefore, you can't use "ret" there; you have to connect the control flow back to the correct suspend call. The switch makes that connection. So the return block looks like this: <run dtors for parameters, if required> <conversion ops for ret value, if required> call coro.first_time_ret_value(value) ; CoroSplit replaces this with a ret switch ... ; jump to suspend; this is always dead in the lowered version The dead switch is there so the optimizer will understand the control flow. And yes, this would be much more straightforward with a two-function approach. Clarification:> =============> > >> Also, if some non-C++ language wants to generate coroutines, > >> it might not have to generate the return block at all. > > C++ coroutines are flexible. The semantic of a coroutine is defined via > traits, so you may define a coroutine that returns void. It does not have > to return coroutine handle or some struct that wraps the coroutine handle. >Oh, okay. I haven't actually looked at the spec; I'm mostly just going off your description of what it does. -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160611/f5220808/attachment-0001.html>
Hi Eli:>> Block1: >> %0 = call i8 coro.suspend() >> switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1 >> Suspend1: >> switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1 >> >> This doesn't look right: intuitively the suspend happens after the return >> block runs.Perhaps, but, that is not the intended semantics. The following two things are distinct concepts: * Coroutine is considered suspended * Coroutine returns control back to its caller (initial caller, resumer, or destroyer) Coroutine is considered suspended means that it is legal to use coro.done, coro.resume, coro.destroy intrinsics (either in this thread or in a different one). It may seem strange, but, consider this example: task<void> f() { char buf[N]; ... fill buff with stuff await async_send(buf); // <suspend-point> ... do more } IR for await expression will be (conceptually): %0 = coro.save() async_send(%buf) coro.suspend(%0) coro.save is the point when coroutine is considered suspended. This allows async_send to resume the coroutine even before async_send returned control back to f. To make it safe, absolutely nothing should sneak-in between coro.save and coro.suspend beyond what frontend/user put there. Aside: ----- Based on what we discussed earlier, I suspect that optimizer may try to sneak stuff in between coro.save and coro.suspend. If it is the case, I have a solution :-), Richard Smith suggested it earlier to me. Use a distinct function (Just like you said with coro.fork :-)). There will be no distinct coro.save. Instead, coro.suspend will take an optional function pointer parameter and frontend will generate a little function for all the code that needs to go in between coro.save and coro.suspend. f.await(%frame) { %buf = get from %frame async_send(%buf) } in f() coro.suspend([...], &f.await) CoroEarly will do the extraction. Frontend will keep emitting coro.save/coro.suspend pairs. After CoroEarly, coro.save's will be removed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P A U S E ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Okay getting back to the switch in the return block.>> The switch answers the question of where the control flow actually goes >> after the return block runs.Since control can reenter the coroutine at any point after coro.save and before, I don't think there is any point where putting the switch can help. We already told the optimizer that we have three possible paths out of suspend: return (first time), resume and destroy. It think it is as close to the truth as we can get.>> Another way to put it is that it should be possible to lower a coroutine >> to a thread rather than performing the state machine transformation.That is very good point! It is a great way to think about the semantics. Unfortunately, I cannot avoid doing state machine transformation in general case. For the coroutines with "initial_suspend" point, i.e. it suspends before entering user authored body, it is very easy. future<void> f(Params) { // <initial suspend point> at open curly char buf[N]; ... fill buff with stuff await async_send(buf); // <suspend-point> ... do more } Will be transformed into equivalent of: future<void> f(Params) { promise<void> prom; future<void> result = prom.get_future(); std::thread([prom = std::move(prom), Params = std::move(params)] { char buf[N]; ... fill buff with stuff BLOCK_WAIT async_send(buf); // block on event instead of suspend ... do more }).detach(); return result; } No state machine transformation whatsoever, just capture the parameters and promise and good to go. [No need to involve llvm either :-) it is a simple lambda like transformation that can be done easily in frontend] What if we don't have an initial suspend point? That means that the coroutine executes in the caller thread until it hits the suspend point. Well, we need to build up a coroutine frame, we need to chop up the function at coro.suspends. We need to put an index in a coro frame and a switch in an entry block of resume and suspend (if more than one suspend point). The only difference is what coro.suspend is replaced with: In start: coro.suspend launches a thread and gives f.resume as a start routine and a frame pointer as a context. So, f.resume will be executing concurrently with f slowly finishing up and returning back to the caller. (Btw, that is exactly the semantics we are trying to capture) In resume, coro.suspend is replaced with a blocking call waiting on whatever will signal that the thread can resume. ====================Two function approach ==================== Even though I pushing back against it, I might prototype it and see how it looks anyways. I will outline in CoroEarly and during CoroSplit inline it back so that I can build the coroutine frame properly. But even as I am typing it, it just feels too much. Look what we are trying to solve is this. %0 = coro.suspend() ; A thing should not be hoisted here %first = icmp %0, 0 ; or here br %first, %return, %suspend ; <=== please don't move stuff above me suspend: %1 = icmp %0, 1 br %1, label %resume, label %destroy resume: A Thing That Can Be Hoisted (same as in destroy) [...] destroy: A Thing That Can Be Hoisted (same as in resume) [...] return: Something Else Or maybe %0 = coro.suspend() ; A thing should not be hoisted here %first = coro.is.first(%0) ; or here br %first, %return, %suspend ; <=== please don't move stuff above me suspend: br %0, label %resume, label %destroy resume: A Thing That Can Be Hoisted (same as in destroy) [...] destroy: A Thing That Can Be Hoisted (same as in resume) [...] return: Something Else Would the optimizer really try to stick 'A thing' above br %first? As I mentioned before, I am very new to this and probably missing a lot, but, (apart from outlining a portion of f), is there no simple way of preventing instructions with side effects from resume or destroy blocks being moved above 'br %first branch'? (and if there is no good answer, two function it is. I will stop torturing you with this :-)) Gor