thr3ads.net - llvm dev - [llvm-dev] [RFC] LLVM Coroutines [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Eli Friedman via llvm-dev

2016-Jun-11 02:40 UTC

[llvm-dev] [RFC] LLVM Coroutines

On Fri, Jun 10, 2016 at 5:25 PM, Gor Nishanov <gornishanov at gmail.com>
wrote:
> Hi Eli:
>
> >> Naively, you would expect that it would be legal to hoist the
store...
> >> but that breaks your coroutine semantics because the global could
be
> mutated
> >> between the first return and the resume.
>
> Hmmm... I don't see the problem. I think hoisting the store is
perfectly
> legal
> transformation with all semantics preserved.
>
> Let's look at your example:
>
> >> block1:
> >>   suspend
> >>   switch (resume, destroy, return)
> >>
> >> resume:
> >>   store zero  to global @g
> >>   doA()
> >>   [...]
> >>
> >> destroy:
> >>   store zero  to global @g
> >>   doB()
> >>   [...]
> >>
> >> return:
> >>   store zero to global @g
> >>   doC
> >>   [...]
>
> As written the behavior is:
>
>   1) when we encounter a suspend during the first pass through the
> function,
>      store 0 to @g and doC()
>
>   2) when we encounter a suspend after coroutine was resumed
>      ret void
>
>   3) When coroutine is resumed:
>      store 0 to @g and doA()
>
>   4) When coroutine is destroyed:
>      store 0 to @g and doB()
>
> If this is a coroutine that can be resumed asynchronously from a different
> thread there is a data race. For example, if resume happens before
'f'
> returns,
> doA() can write something useful to @g, and 'f' will clobber it
with zero.
> But, this race is already present in the code and is not introduced by
> LLVM.
>
> Let's hoist the store and see what happens:
>
> >> block1:
> >>   suspend
> >>   store zero  to global @g
> >>   switch (resume, destroy, return)
> >>
> >> resume:
> >>   doA()
> >>   [...]
> >>
> >> destroy:
> >>   doB()
> >>   [...]
> >>
> >> return:
> >>   doC()
> >>   [...]
>
> Now, let's split it up:
>   1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy
>   2. remove coro.suspend in f, replace with ret void in f.resume
>
> void f() {
>   [...]
>   store zero  to global @g
>   doC();
>   [...]
> }
>
> void @f.resume() {
> entry:
>   store zero  to global @g
>   doA();
>   [....]
> }
>
> void @f.destroy() {
> entry:
>   store zero  to global @g
>   doB();
>   [....]
> }
>
> Behavior looks exactly as before. What am I missing?
>
>Err... I guess nothing; sorry, I got myself confused.  Essentially
executing a return statement, then jumping back, seems wrong, but I'm
having trouble coming up with a good example of how it would actually break
something.  I guess there aren't really any issues on a pure control-flow
basis: you can't execute a global side-effect a different number of times
than it would otherwise happen.

You might run into problems with allocas: LLVM's optimizer will assume the
lifetime of any alloca in the current function ends when you hit a
"ret"
instruction, so jumping back from the ret to the middle of the function
could lead to wrong results.  Silly example:

x = alloca...

block1:
  suspend
  ; increment could get moved here.
  switch (resume, destroy, return)

resume:
  x += 1
   [...]

destroy:
  x += 1
   [...]

return:
  (don't touch x)
  [...]

The increment is only supposed to execute once, but instead executes twice.

This isn't a great example... and I'm not sure issues are limited to
allocas... but that's the idea, anyway.

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160610/39f19df7/attachment.html>

Gor Nishanov via llvm-dev

2016-Jun-11 05:32 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

Hi Eli:
>> Err... I guess nothing; sorry, I got myself confused.
That is my fault :-). I confused myself in the beginning where I was trying to
treat coro.suspend + cond.branch pair as a terminator and was looking for
branches. You showed me the way... Just ignore the branches and represent
the control flow cleanly and SimplifyCFG will take care of the rest :-).
>> You might run into problems with allocas:
>>   x = alloca...
>>
>> block1:
>>   suspend
>>   ; increment could get moved here.
>>   switch (resume, destroy, return)
>>
>> resume:
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> destroy:
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> return:
>>   (don't touch x)
>>   [...]
Yep. You are right. Need some kind of barrier to prevent the motion.

Okay. New intrinsics:

declare void coro.barrier(i64); // will get a unique number for every insertion
>>   x = alloca...
>>
>> block1:
>>   suspend
>>   ; nothing can sneak in here???, Right?
>>   switch (resume, destroy, return)
>>
>> resume:
>>   coro.barrier(0)
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> destroy:
>>   coro.barrier(1)
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> return:
>>   coro.barrier(2)
>>   (don't touch x)
>>   [...]
Now we are good, right?
I am still not giving up hope for an intrinsic approach.

Thanks a lot, Eli, for your help with the design!
Gor

P.S

There is always a backup plan to go to your "two function" idea with a
slight
twist.


Almost two functions (coro.fork(), coro.suspendO() same as in original proposal)
===============================================================================
CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using
coro.fork as a guide (since false branch bypasses the coroutine body).

Coroutine will now travel as two functions until CoroSplit gets to f.start
(Probably will use 'coroutine' attribute on a function to indicate that
it
requires CoroSplit to munge on it.)

I am still hoping we can find a solution without requiring to separate f.start
out. Seems like a huge hammer to avoid undesirable code motion / hoisting.

(I really hope that @coro.barrier(i64) approach can work)

On Fri, Jun 10, 2016 at 7:40 PM, Eli Friedman <eli.friedman at gmail.com>
wrote:> On Fri, Jun 10, 2016 at 5:25 PM, Gor Nishanov <gornishanov at
gmail.com> wrote:
>>
>> Hi Eli:
>>
>> >> Naively, you would expect that it would be legal to hoist the
store...
>> >> but that breaks your coroutine semantics because the global
could be
>> >> mutated
>> >> between the first return and the resume.
>>
>> Hmmm... I don't see the problem. I think hoisting the store is
perfectly
>> legal
>> transformation with all semantics preserved.
>>
>> Let's look at your example:
>>
>> >> block1:
>> >>   suspend
>> >>   switch (resume, destroy, return)
>> >>
>> >> resume:
>> >>   store zero  to global @g
>> >>   doA()
>> >>   [...]
>> >>
>> >> destroy:
>> >>   store zero  to global @g
>> >>   doB()
>> >>   [...]
>> >>
>> >> return:
>> >>   store zero to global @g
>> >>   doC
>> >>   [...]
>>
>> As written the behavior is:
>>
>>   1) when we encounter a suspend during the first pass through the
>> function,
>>      store 0 to @g and doC()
>>
>>   2) when we encounter a suspend after coroutine was resumed
>>      ret void
>>
>>   3) When coroutine is resumed:
>>      store 0 to @g and doA()
>>
>>   4) When coroutine is destroyed:
>>      store 0 to @g and doB()
>>
>> If this is a coroutine that can be resumed asynchronously from a
different
>> thread there is a data race. For example, if resume happens before
'f'
>> returns,
>> doA() can write something useful to @g, and 'f' will clobber it
with zero.
>> But, this race is already present in the code and is not introduced by
>> LLVM.
>>
>> Let's hoist the store and see what happens:
>>
>> >> block1:
>> >>   suspend
>> >>   store zero  to global @g
>> >>   switch (resume, destroy, return)
>> >>
>> >> resume:
>> >>   doA()
>> >>   [...]
>> >>
>> >> destroy:
>> >>   doB()
>> >>   [...]
>> >>
>> >> return:
>> >>   doC()
>> >>   [...]
>>
>> Now, let's split it up:
>>   1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy
>>   2. remove coro.suspend in f, replace with ret void in f.resume
>>
>> void f() {
>>   [...]
>>   store zero  to global @g
>>   doC();
>>   [...]
>> }
>>
>> void @f.resume() {
>> entry:
>>   store zero  to global @g
>>   doA();
>>   [....]
>> }
>>
>> void @f.destroy() {
>> entry:
>>   store zero  to global @g
>>   doB();
>>   [....]
>> }
>>
>> Behavior looks exactly as before. What am I missing?
>>
>
> Err... I guess nothing; sorry, I got myself confused.  Essentially
executing
> a return statement, then jumping back, seems wrong, but I'm having
trouble
> coming up with a good example of how it would actually break something.  I
> guess there aren't really any issues on a pure control-flow basis: you
can't
> execute a global side-effect a different number of times than it would
> otherwise happen.
>
> You might run into problems with allocas: LLVM's optimizer will assume
the
> lifetime of any alloca in the current function ends when you hit a
"ret"
> instruction, so jumping back from the ret to the middle of the function
> could lead to wrong results.  Silly example:
>
> x = alloca...
>
> block1:
>   suspend
>   ; increment could get moved here.
>   switch (resume, destroy, return)
>
> resume:
>   x += 1
>    [...]
>
> destroy:
>   x += 1
>    [...]
>
> return:
>   (don't touch x)
>   [...]
>
> The increment is only supposed to execute once, but instead executes twice.
>
> This isn't a great example... and I'm not sure issues are limited
to
> allocas... but that's the idea, anyway.
>
> -Eli
>

Eli Friedman via llvm-dev

2016-Jun-11 21:27 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

On Fri, Jun 10, 2016 at 10:32 PM, Gor Nishanov <gornishanov at gmail.com>
wrote:
> Okay. New intrinsics:
>
> declare void coro.barrier(i64); // will get a unique number for every
> insertion
>
> >>   x = alloca...
> >>
> >> block1:
> >>   suspend
> >>   ; nothing can sneak in here???, Right?
> >>   switch (resume, destroy, return)
> >>
> >> resume:
> >>   coro.barrier(0)
> >>   x += 1 ; load + inc + store
> >>   [...]
> >>
> >> destroy:
> >>   coro.barrier(1)
> >>   x += 1 ; load + inc + store
> >>   [...]
> >>
> >> return:
> >>   coro.barrier(2)
> >>   (don't touch x)
> >>   [...]
>
> Now we are good, right?
> I am still not giving up hope for an intrinsic approach.
>
coro.barrier() doesn't work: if the address of the alloca doesn't
escape,
alias analysis will assume the barrier can't read or write the value of the
alloca, so the barrier doesn't actually block code movement.

There is always a backup plan to go to your "two function" idea with
a> slight
> twist.
>
>
> Almost two functions (coro.fork(), coro.suspendO() same as in original
> proposal)
>
>
===============================================================================>
> CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using
> coro.fork as a guide (since false branch bypasses the coroutine body).
>
> Coroutine will now travel as two functions until CoroSplit gets to f.start
> (Probably will use 'coroutine' attribute on a function to indicate
that it
> requires CoroSplit to munge on it.)
>
> I am still hoping we can find a solution without requiring to separate
> f.start
> out. Seems like a huge hammer to avoid undesirable code motion / hoisting.
>
> (I really hope that @coro.barrier(i64) approach can work)
>

If you really don't want two functions, you could probably do something
like this:

block1:
%first_time = load...
br i1 %first_time, label return, label suspend1

suspend1:
suspend
switch (resume1, destroy1)

block2:
%first_time = load...
br i1 %first_time, label return, label suspend2

suspend2:
suspend
switch (resume2, destroy2)

label return:
%cont = phi i32 [(block1, 1), (block2, 2)]
; do stuff
switch %cont... (1, label suspend1), (2, label suspend2)

It's a little awkward to generate the return block, but otherwise it's
reasonably clean. Also, if some non-C++ language wants to generate
coroutines, it might not have to generate the return block at all.

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160611/8a77eb6a/attachment.html>

Eli Friedman via llvm-dev

2016-Jun-12 01:08 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

On Sat, Jun 11, 2016 at 5:09 PM, Gor Nishanov <gornishanov at gmail.com>
wrote:
> HI Eli:
>
> >> coro.barrier() doesn't work: if the address of the alloca
doesn't
> escape,
> >> alias analysis will assume the barrier can't read or write the
value of
> >> the alloca, so the barrier doesn't actually block code
movement.
>
> Got it. I am new to this and learning a lot over the course
> of this thread. Thank you for being patient with me.
>
> Two questions and one clarification:
>
> Q1: Do we have to have a load here?
> ==================================>
> >> block1:
> >>    %first_time = load... <--- What are we loading here?
>
Just an local alloca, initialized to false, and changed to true in the
return block.

> >>    br i1 %first_time, label return, label suspend1
> >>
> >> supend1:
> >>    %0 = coro.suspend()
> >>    switch %0 (resume1, destroy1)
>
> Can we use three way coro.suspend instead?
>
>   Block1:
>     %0 = call i8 coro.suspend()
>     switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
>   Suspend1:
>     switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1
>
This doesn't look right: intuitively the suspend happens after the return
block runs.

> One problem I can see is that someone can write a pass that might merge
> two branches / switches into one switch and we are back where we were.
> I guess what you meant by load, is to call some coro.is.first.time()
> intrinsic.
> So it looks like:
>
> >> block1:
> >>    %first_time = call i1 coro.is.first.time()
> >>    br i1 %first_time, label return, label suspend1
> >>
> >> supend1:
> >>    %0 = coro.suspend()
> >>    switch %0 (resume1, destroy1)
>
> This looks fine, there may be more uses for this intrinsic in the frontend.
> Killing two birds with one stone. Good.
>
It doesn't really matter whether the bit gets tracked in an alloca or
through intrinsics.

>
> Question 2: Why the switch in the return block?
> ==============================================>
> I would think that **pre-split** return block would be simply:
>
> return:
>    <run dtors for parameters, if required>
>    <conversion ops for ret value, if required>
>    <ret void> or <ret whatever>
>
> Where and why I should put a switch that you mentioned in this return
> block?
>
> BTW, I am speaking of the return block as if it is one block,
> but, it could be a dominating block over all the blocks that together
> run the destructors, do return value conversion, etc.
>
The best way to be sure the compiler will understand the control flow is if
the coroutine acts like a normal function.  Another way to put it is that
it should be possible to lower a coroutine to a thread rather than
performing the state machine transformation.

The switch answers the question of where the control flow actually goes
after the return block runs.  Under normal function semantics, the
"return"
block doesn't actually return: it just performs the one-time operations,
then jumps back to the suspend call.  Therefore, you can't use
"ret" there;
you have to connect the control flow back to the correct suspend call.  The
switch makes that connection.  So the return block looks like this:

   <run dtors for parameters, if required>
   <conversion ops for ret value, if required>
   call coro.first_time_ret_value(value) ; CoroSplit replaces this with a
ret
   switch ... ; jump to suspend; this is always dead in the lowered version

The dead switch is there so the optimizer will understand the control flow.

And yes, this would be much more straightforward with a two-function
approach.

Clarification:> =============>
> >> Also, if some non-C++ language wants to generate coroutines,
> >> it might not have to generate the return block at all.
>
> C++ coroutines are flexible. The semantic of a coroutine is defined via
> traits, so you may define a coroutine that returns void. It does not have
> to return coroutine handle or some struct that wraps the coroutine handle.
>
Oh, okay.  I haven't actually looked at the spec; I'm mostly just going
off
your description of what it does.

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160611/f5220808/attachment-0001.html>

Gor Nishanov via llvm-dev

2016-Jun-12 04:11 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

Hi Eli:
>>   Block1:
>>     %0 = call i8 coro.suspend()
>>     switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
>>   Suspend1:
>>     switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1
>>
>> This doesn't look right: intuitively the suspend happens after the
return
>> block runs.
Perhaps, but, that is not the intended semantics.

The following two things are distinct concepts:

  * Coroutine is considered suspended

  * Coroutine returns control back to its caller
    (initial caller, resumer, or destroyer)

Coroutine is considered suspended means that it is legal to use coro.done,
coro.resume, coro.destroy intrinsics (either in this thread or in a different
one). It may seem strange, but, consider this example:

  task<void> f() {
    char buf[N];
    ... fill buff with stuff
    await async_send(buf); // <suspend-point>
    ... do more
  }

IR for await expression will be (conceptually):

  %0 = coro.save()
  async_send(%buf)
  coro.suspend(%0)

coro.save is the point when coroutine is considered suspended.
This allows async_send to resume the coroutine even before async_send
returned control back to f.

To make it safe, absolutely nothing should sneak-in between coro.save
and coro.suspend beyond what frontend/user put there.

Aside:
-----
Based on what we discussed earlier, I suspect that optimizer may try
to sneak stuff in between coro.save and coro.suspend. If it is the case,
I have a solution :-), Richard Smith suggested it earlier to me.
Use a distinct function (Just like you said with coro.fork :-)).

There will be no distinct coro.save. Instead, coro.suspend will take an
optional function pointer parameter and frontend will generate a little
function for all the code that needs to go in between coro.save and
coro.suspend.

  f.await(%frame) {
    %buf = get from %frame
    async_send(%buf)
  }

  in f()

    coro.suspend([...], &f.await)

CoroEarly will do the extraction. Frontend will keep emitting
coro.save/coro.suspend pairs. After CoroEarly, coro.save's will be removed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  P A U S E
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Okay getting back to the switch in the return block.
>> The switch answers the question of where the control flow actually goes
>> after the return block runs.
Since control can reenter the coroutine at any point after coro.save
and before, I don't think there is any point where putting the switch can
help.

We already told the optimizer that we have three possible paths out of suspend:
return (first time), resume and destroy. It think it is as close to the
truth as we can get.
>> Another way to put it is that it should be possible to lower a
coroutine
>> to a thread rather than performing the state machine transformation.
That is very good point! It is a great way to think about the semantics.

Unfortunately, I cannot avoid doing state machine transformation in general
case.

For the coroutines with "initial_suspend" point, i.e. it suspends
before
entering user authored body, it is very easy.

  future<void> f(Params) { // <initial suspend point> at open curly
    char buf[N];
    ... fill buff with stuff
    await async_send(buf); // <suspend-point>
    ... do more
  }

Will be transformed into equivalent of:

  future<void> f(Params) {
    promise<void> prom;
    future<void> result = prom.get_future();

    std::thread([prom = std::move(prom), Params = std::move(params)] {
      char buf[N];
      ... fill buff with stuff
      BLOCK_WAIT async_send(buf); // block on event instead of suspend
      ... do more
    }).detach();

    return result;
  }

No state machine transformation whatsoever, just capture the parameters and
promise and good to go. [No need to involve llvm either :-) it is a simple
lambda like transformation that can be done easily in frontend]

What if we don't have an initial suspend point? That means that the
coroutine executes in the caller thread until it hits the suspend point.
Well, we need to build up a coroutine frame, we need to chop up the function
at coro.suspends. We need to put an index in a coro frame and a switch in an
entry block of resume and suspend (if more than one suspend point).

The only difference is what coro.suspend is replaced with:

In start: coro.suspend launches a thread and gives f.resume as a start
routine and a frame pointer as a context. So, f.resume will be executing
concurrently with f slowly finishing up and returning back to the caller.
(Btw, that is exactly the semantics we are trying to capture)

In resume, coro.suspend is replaced with a blocking call waiting on
whatever will signal that the thread can resume.

====================Two function approach
====================
Even though I pushing back against it, I might prototype it and see how it
looks anyways.

I will outline in CoroEarly and during CoroSplit inline it back so that
I can build the coroutine frame properly.

But even as I am typing it, it just feels too much.
Look what we are trying to solve is this.

        %0 = coro.suspend()
        ; A thing should not be hoisted here
        %first = icmp %0, 0
        ; or here
        br %first, %return, %suspend ; <=== please don't move stuff above
me

      suspend:
        %1 = icmp %0, 1
        br %1, label %resume, label %destroy

      resume:
        A Thing That Can Be Hoisted (same as in destroy)
        [...]

      destroy:
        A Thing That Can Be Hoisted (same as in resume)
        [...]

      return:
        Something Else

Or maybe

        %0 = coro.suspend()
        ; A thing should not be hoisted here
        %first = coro.is.first(%0)
        ; or here
        br %first, %return, %suspend ; <=== please don't move stuff above
me

      suspend:
        br %0, label %resume, label %destroy

      resume:
        A Thing That Can Be Hoisted (same as in destroy)
        [...]

      destroy:
        A Thing That Can Be Hoisted (same as in resume)
        [...]

      return:
        Something Else

Would the optimizer really try to stick 'A thing' above br %first?

As I mentioned before, I am very new to this and probably missing a lot,
but, (apart from outlining a portion of f), is there no simple way of
preventing instructions with side effects from resume or destroy blocks
being moved above 'br %first branch'?

(and if there is no good answer, two function it is. I will stop
torturing you with this :-))

Gor

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Jun 2016 - [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

Seemingly Similar Threads