thr3ads.net - llvm dev - [llvm-dev] [RFC] LLVM Coroutines [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Gor Nishanov via llvm-dev

2016-Jun-13 01:40 UTC

[llvm-dev] [RFC] LLVM Coroutines

Hi Sanjoy:
>> Now in the above CFG %val can be folded to 10, but in reality you need
>> a PHI of 10 and 20
Quick answer: folding %val to 10 is OK, but, I would prefer it to be
'undef'
or even catch it in the verifier as it is a programmer/frontend error.

Details are in the second half of my reply. I would like to start
with the thought experiment you suggested, as it might help to explain
the behavior a little better.
>> The thought experiment relevant here is that **could** you implement
>> your semantics with reasonable bodies for the llvm.coro intrinsics?
I can emulate **control flow** very closely with user mode thread switching
APIs like make_context, jump_context. Memory behavior not so much.

Library emulation
================
Let's build a fiber abstraction out of make_context, jump_context.

i1 fiber_fork(i8* %mem) - clones this function into a fiber
                          returns 0 - in a fiber
                          returns 1 - in a thread
                          uses %mem to store fiber context + fiber stack

i1 fiber_suspend()      - suspends current fiber, switches back to thread
                          stores the fiber context to be able to resume later

void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by %mem,
                          fiber_suspend will return %val when resumed

void fiber_end() noreturn
                        - switches back to a thread, does not remember fiber
                          context and does not touch %mem.
                          (Thus you can use this after %mem is freed)


Detail: fiber_fork(i8* %mem)
----------------------------
  * Prepares fcontext in %mem utilizing the rest of the memory for stack.
  * Saves current thread context on a current thread stack.
    (We should not store if in %mem, otherwise we won't be
     able to switch back after fiber memory is freed)
  * Sets up **current thread** context, so that when resumed it will proceed
    as if fiber_fork returned true
  * Sets up **fiber** context so that when resumed, the fiber will proceed
    as if fiber_fork returned false.
  * Clones the locals on the thread stack to a fiber stack
    (This is very approximate behavior, and won't work if we took address of
     alloca and stashed it somewhere, but let's ignore it for now as we
     are focusing on control flow)
  * Switches to fiber

Okay, let's map it to our intrinsics:

coro.fork == fiber_fork
coro.suspend == fiber_suspend
coro.end(true) == fiber_end
coro.resume = fiber_join(%mem, i1 false)
coro.destroy = fiber_join(%mem, i1 true)
coro.size == 1,000,000 (or some other big number)

Now let's walk through this example in the fiber model:

   T coro() {
       %t = alloca
       %mem = malloc(...);
       (*%t) = 10
       if (coro.fork()) { // switches to fiber at label Start
   ReturnBB:
         %val = *%t // will be 10, as fiber has its own copy
         return some_code_creating_T(%val);
       }
   Start:
       Body Of The Coroutine
       With Suspends and All The Fun
   DeleteBB:
       (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
       free(%mem)
       (*%t) = 20 ; // undef: as the fiber memory is freed already
       coro.end(true) // switches back to the thread at ReturnBB
       UNREACHABLE // <== never reaches here. So we are honest with optimizer
     }


Back to the LLVM based implementation
====================================
In your example, there will be a path from the block where %t is used to
where %t is defined that crosses a suspend point. Therefore %t will be part of
the coroutine frame. So before the split, the program will be transformed into:

   coro.frame = type { int %t }

   T coro() {
       %t = alloca
       %mem = malloc(...);
       %s = (coro.frame*)coro.init(%mem)
       s->%t = 10
       if (coro.fork()) {
   ReturnBB:
         %val = 10 // const.prop as 10, OK, but really undef
         return some_code_creating_T(%val);
       }
       Body Of The Coroutine
       With Suspends and All The Fun
   DeleteBB:
       free(%mem)
       s->%t = 20 //// BOOM writing to freed memory
       coro.end(true)
       UNREACHABLE
     }

Accessing anything in the coroutine frame after coro.fork "true" is
undefined.
As by the time you get to returnBB, the coroutine frame could already
be destroyed.

Both problems can be statically detected in the Verifier.

Alternatively, in CoroEarly pass that runs at EP_EarlyAsPossible extension
point, we can replace:

         %val = *%t
             with
         %val = undef (used after coro.fork() returned true)

              and

       free(%mem)
       (*%t) = 20
            with
       @llvm.trap() ; used after frame is destroyed


--Gor

Gor Nishanov via llvm-dev

2016-Jun-13 03:29 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

Small clarification. In coro.* emulation via fibers, the comment on
coro.end(true)

       coro.end(true) // switches back to the thread at ReturnBB <-- this one
       UNREACHABLE // <== never reaches here. So we are honest with optimizer
     }

should read something like:

    coro.end(true) // switches back to the thread that resumed this fiber.

Which will be ReturnBB if we did not hit any suspends on the way to coro.end.
If we did hit a suspend, it is the suspend that will switch back to ReturnBB.

With llvm implementation, it is modeled by replacing coro.suspend and
coro.end(true) with
br ReturnBB in the initial function which represents the first
execution of the coroutine before it reached any suspends.

In resume function, coro.suspend and coro.end are replaced with `ret
void`, thus modelling returning to whomever resumed us.

To me it looks that it does match fiber behavior exactly.

-- Gor

On Sun, Jun 12, 2016 at 6:40 PM, Gor Nishanov <gornishanov at gmail.com>
wrote:> Hi Sanjoy:
>
>>> Now in the above CFG %val can be folded to 10, but in reality you
need
>>> a PHI of 10 and 20
>
> Quick answer: folding %val to 10 is OK, but, I would prefer it to be
'undef'
> or even catch it in the verifier as it is a programmer/frontend error.
>
> Details are in the second half of my reply. I would like to start
> with the thought experiment you suggested, as it might help to explain
> the behavior a little better.
>
>>> The thought experiment relevant here is that **could** you
implement
>>> your semantics with reasonable bodies for the llvm.coro intrinsics?
>
> I can emulate **control flow** very closely with user mode thread switching
> APIs like make_context, jump_context. Memory behavior not so much.
>
> Library emulation
> ================>
> Let's build a fiber abstraction out of make_context, jump_context.
>
> i1 fiber_fork(i8* %mem) - clones this function into a fiber
>                           returns 0 - in a fiber
>                           returns 1 - in a thread
>                           uses %mem to store fiber context + fiber stack
>
> i1 fiber_suspend()      - suspends current fiber, switches back to thread
>                           stores the fiber context to be able to resume
later
>
> void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by
%mem,
>                           fiber_suspend will return %val when resumed
>
> void fiber_end() noreturn
>                         - switches back to a thread, does not remember
fiber
>                           context and does not touch %mem.
>                           (Thus you can use this after %mem is freed)
>
>
> Detail: fiber_fork(i8* %mem)
> ----------------------------
>   * Prepares fcontext in %mem utilizing the rest of the memory for stack.
>   * Saves current thread context on a current thread stack.
>     (We should not store if in %mem, otherwise we won't be
>      able to switch back after fiber memory is freed)
>   * Sets up **current thread** context, so that when resumed it will
proceed
>     as if fiber_fork returned true
>   * Sets up **fiber** context so that when resumed, the fiber will proceed
>     as if fiber_fork returned false.
>   * Clones the locals on the thread stack to a fiber stack
>     (This is very approximate behavior, and won't work if we took
address of
>      alloca and stashed it somewhere, but let's ignore it for now as we
>      are focusing on control flow)
>   * Switches to fiber
>
> Okay, let's map it to our intrinsics:
>
> coro.fork == fiber_fork
> coro.suspend == fiber_suspend
> coro.end(true) == fiber_end
> coro.resume = fiber_join(%mem, i1 false)
> coro.destroy = fiber_join(%mem, i1 true)
> coro.size == 1,000,000 (or some other big number)
>
> Now let's walk through this example in the fiber model:
>
>    T coro() {
>        %t = alloca
>        %mem = malloc(...);
>        (*%t) = 10
>        if (coro.fork()) { // switches to fiber at label Start
>    ReturnBB:
>          %val = *%t // will be 10, as fiber has its own copy
>          return some_code_creating_T(%val);
>        }
>    Start:
>        Body Of The Coroutine
>        With Suspends and All The Fun
>    DeleteBB:
>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>        free(%mem)
>        (*%t) = 20 ; // undef: as the fiber memory is freed already
>        coro.end(true) // switches back to the thread at ReturnBB
>        UNREACHABLE // <== never reaches here. So we are honest with
optimizer
>      }
>
>
> Back to the LLVM based implementation
> ====================================>
> In your example, there will be a path from the block where %t is used to
> where %t is defined that crosses a suspend point. Therefore %t will be part
of
> the coroutine frame. So before the split, the program will be transformed
into:
>
>    coro.frame = type { int %t }
>
>    T coro() {
>        %t = alloca
>        %mem = malloc(...);
>        %s = (coro.frame*)coro.init(%mem)
>        s->%t = 10
>        if (coro.fork()) {
>    ReturnBB:
>          %val = 10 // const.prop as 10, OK, but really undef
>          return some_code_creating_T(%val);
>        }
>        Body Of The Coroutine
>        With Suspends and All The Fun
>    DeleteBB:
>        free(%mem)
>        s->%t = 20 //// BOOM writing to freed memory
>        coro.end(true)
>        UNREACHABLE
>      }
>
> Accessing anything in the coroutine frame after coro.fork "true"
is undefined.
> As by the time you get to returnBB, the coroutine frame could already
> be destroyed.
>
> Both problems can be statically detected in the Verifier.
>
> Alternatively, in CoroEarly pass that runs at EP_EarlyAsPossible extension
> point, we can replace:
>
>          %val = *%t
>              with
>          %val = undef (used after coro.fork() returned true)
>
>               and
>
>        free(%mem)
>        (*%t) = 20
>             with
>        @llvm.trap() ; used after frame is destroyed
>
>
> --Gor

Sanjoy Das via llvm-dev

2016-Jun-14 20:25 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

Hi Gor,

On Sun, Jun 12, 2016 at 6:40 PM, Gor Nishanov <gornishanov at gmail.com>
wrote:> Quick answer: folding %val to 10 is OK, but, I would prefer it to be
'undef'
> or even catch it in the verifier as it is a programmer/frontend error.
Ok.
>>> The thought experiment relevant here is that **could** you
implement
>>> your semantics with reasonable bodies for the llvm.coro intrinsics?
>
> I can emulate **control flow** very closely with user mode thread switching
> APIs like make_context, jump_context. Memory behavior not so much.
>
> Library emulation
> ================>
> Let's build a fiber abstraction out of make_context, jump_context.
>
> i1 fiber_fork(i8* %mem) - clones this function into a fiber
>                           returns 0 - in a fiber
>                           returns 1 - in a thread
>                           uses %mem to store fiber context + fiber stack
I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
setjmp, in that it can "return twice"?  If so, I'd urge you to try
to
not rely on that kind of behavior.  LLVM has unresolved issues around
modeling setjmp like behavior, and if possible, it is preferable to
make all of the control flow explicit from the very beginning.
> i1 fiber_suspend()      - suspends current fiber, switches back to thread
>                           stores the fiber context to be able to resume
later
This looks like a normal function call, except that the "return" may
be arbitrarily spaced out from the call?
> void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by
%mem,
>                           fiber_suspend will return %val when resumed
>
> void fiber_end() noreturn
>                         - switches back to a thread, does not remember
fiber
>                           context and does not touch %mem.
>                           (Thus you can use this after %mem is freed)
>
>
> Detail: fiber_fork(i8* %mem)
> ----------------------------
>   * Prepares fcontext in %mem utilizing the rest of the memory for stack.
>   * Saves current thread context on a current thread stack.
>     (We should not store if in %mem, otherwise we won't be
>      able to switch back after fiber memory is freed)
As I said above, these following two points can be problematic:
>   * Sets up **current thread** context, so that when resumed it will
proceed
>     as if fiber_fork returned true
>   * Sets up **fiber** context so that when resumed, the fiber will proceed
>     as if fiber_fork returned false.
>
> Okay, let's map it to our intrinsics:
>
> coro.fork == fiber_fork
> coro.suspend == fiber_suspend
> coro.end(true) == fiber_end
> coro.resume = fiber_join(%mem, i1 false)
> coro.destroy = fiber_join(%mem, i1 true)
> coro.size == 1,000,000 (or some other big number)
>
> Now let's walk through this example in the fiber model:
>
>    T coro() {
>        %t = alloca
>        %mem = malloc(...);
>        (*%t) = 10
>        if (coro.fork()) { // switches to fiber at label Start
>    ReturnBB:
>          %val = *%t // will be 10, as fiber has its own copy
>          return some_code_creating_T(%val);
>        }
>    Start:
>        Body Of The Coroutine
>        With Suspends and All The Fun
>    DeleteBB:
>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>        free(%mem)
>        (*%t) = 20 ; // undef: as the fiber memory is freed already
Say there was no (*%t) = 20. There is nothing so far that prevents
sinking "(*%t) = 15" from before the free (where it is legal) to after
the free (where it is not).  %t is an unescaped alloca, and LLVM can
move around loads from and stores to these (as a first approximation)
as it pleases.
>        coro.end(true) // switches back to the thread at ReturnBB
>        UNREACHABLE // <== never reaches here. So we are honest with
optimizer
>      }
>
>
> Back to the LLVM based implementation
> ====================================>
> In your example, there will be a path from the block where %t is used to
> where %t is defined that crosses a suspend point. Therefore %t will be part
of
> the coroutine frame. So before the split, the program will be transformed
into:
>
>    coro.frame = type { int %t }
>
>    T coro() {
>        %t = alloca
>        %mem = malloc(...);
>        %s = (coro.frame*)coro.init(%mem)
>        s->%t = 10
>        if (coro.fork()) {
>    ReturnBB:
>          %val = 10 // const.prop as 10, OK, but really undef
>          return some_code_creating_T(%val);
>        }
>        Body Of The Coroutine
>        With Suspends and All The Fun
>    DeleteBB:
>        free(%mem)
>        s->%t = 20 //// BOOM writing to freed memory
>        coro.end(true)
>        UNREACHABLE
>      }
>
> Accessing anything in the coroutine frame after coro.fork "true"
is undefined.
> As by the time you get to returnBB, the coroutine frame could already
> be destroyed.
So when you say "some_code_creating_T()" are there restrictions on
what kind of code there could be as part of the creation logic (I've
so far assumed that it can be arbitrary code).  If it can't
(semantically) touch anything in the stack frame then why have it as
part of @coro() ?

-- Sanjoy

Gor Nishanov via llvm-dev

2016-Jun-15 01:22 UTC

head link

[llvm-dev] [RFC] LLVM Coroutines

Hi Sanjoy,
>> I'm not familiar with fiber-type APIs, but I assume fiber_fork is
like
>> setjmp, in that it can "return twice"?
Yes, user-mode stack switching API are somewhat similar to setjmp. Here are
links to a doc page and implementation, just in case you are curious:

http://www.boost.org/doc/libs/1_59_0/libs/context/doc/html/context/context.html
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_ms_pe_masm.asm
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_sysv_elf_gas.S
>> If so, I'd urge you to try to not rely on that kind of behavior.
LLVM has
>> unresolved issues around modeling setjmp like behavior
Absolutely! This was just a thought experiment how one could implement the
intrinsics in a library. One of the benefits of compiler based coroutines is
that they allow to avoid all of that mess. Suspend is just a 'ret void',
resume is simply a direct (or indirect) call and a coroutine state is tiny
(assuming you don't use 64K arrays as local variables :-) )
>>    DeleteBB:
>>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>>        free(%mem)
>>        (*%t) = 20 ; // undef: as the fiber memory is freed already
>>        coro.end(true)
>>        UNREACHABLE
>>
>> Say there was no (*%t) = 20. There is nothing so far that prevents
>> sinking "(*%t) = 15" from before the free (where it is legal)
to after
>> the free (where it is not).  %t is an unescaped alloca, and LLVM can
>> move around loads from and stores to these (as a first approximation)
>> as it pleases.
Good point. During CoroSplit I should sink CoroutineFrame deallocation
all the way down to coro.end().
>> So when you say "some_code_creating_T()" are there
restrictions on
>> what kind of code there could be as part of the creation logic
(I've
>> so far assumed that it can be arbitrary code).
You are right. It *is* arbitrary code. In my reply to you earlier I had a
particular use case in mind, where you should not touch anything in the
coroutine frame in the return block, but that is higher level semantics that
frontend or library imbues coroutine with, not something that LLVM enforces.

The case I had in mind was that a frontend generated an asynchronous coroutine
without RAII semantics. Once launched it will run driven by asynchronous
completions and when it runs to the end it will destroy itself.

Such a coroutine can get suspended, get resumed by a different thread, run to
completion and free the coroutine frame **before** the thread that created
the coroutine manages to get to its return block. In such a coroutine reading
or writing to coroutine frame in return block is undefined behavior.

On the other hand, if it is a synchronous generator that always starts
suspended, then any writes or reads from a coroutine frame in the
return block are perfectly fine.

Since LLVM does not know what kind of coroutine user is trying to build, it
should take a conservative approach. LLVM should not introduce writes or reads
to a coroutine frame in the return block if they weren't there, but, if
frontend put them there, LLVM should not replace them with
`@llvm.traps` and `undef`s either.

Gor

On Tue, Jun 14, 2016 at 1:25 PM, Sanjoy Das
<sanjoy at playingwithpointers.com> wrote:> Hi Gor,
>
> On Sun, Jun 12, 2016 at 6:40 PM, Gor Nishanov <gornishanov at
gmail.com> wrote:
>> Quick answer: folding %val to 10 is OK, but, I would prefer it to be
'undef'
>> or even catch it in the verifier as it is a programmer/frontend error.
>
> Ok.
>
>>>> The thought experiment relevant here is that **could** you
implement
>>>> your semantics with reasonable bodies for the llvm.coro
intrinsics?
>>
>> I can emulate **control flow** very closely with user mode thread
switching
>> APIs like make_context, jump_context. Memory behavior not so much.
>>
>> Library emulation
>> ================>>
>> Let's build a fiber abstraction out of make_context, jump_context.
>>
>> i1 fiber_fork(i8* %mem) - clones this function into a fiber
>>                           returns 0 - in a fiber
>>                           returns 1 - in a thread
>>                           uses %mem to store fiber context + fiber
stack
>
> I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
> setjmp, in that it can "return twice"?  If so, I'd urge you
to try to
> not rely on that kind of behavior.  LLVM has unresolved issues around
> modeling setjmp like behavior, and if possible, it is preferable to
> make all of the control flow explicit from the very beginning.
>
>> i1 fiber_suspend()      - suspends current fiber, switches back to
thread
>>                           stores the fiber context to be able to resume
later
>
> This looks like a normal function call, except that the "return"
may
> be arbitrarily spaced out from the call?
>
>> void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified
by %mem,
>>                           fiber_suspend will return %val when resumed
>>
>> void fiber_end() noreturn
>>                         - switches back to a thread, does not remember
fiber
>>                           context and does not touch %mem.
>>                           (Thus you can use this after %mem is freed)
>>
>>
>> Detail: fiber_fork(i8* %mem)
>> ----------------------------
>>   * Prepares fcontext in %mem utilizing the rest of the memory for
stack.
>>   * Saves current thread context on a current thread stack.
>>     (We should not store if in %mem, otherwise we won't be
>>      able to switch back after fiber memory is freed)
>
> As I said above, these following two points can be problematic:
>
>>   * Sets up **current thread** context, so that when resumed it will
proceed
>>     as if fiber_fork returned true
>>   * Sets up **fiber** context so that when resumed, the fiber will
proceed
>>     as if fiber_fork returned false.
>
>>
>> Okay, let's map it to our intrinsics:
>>
>> coro.fork == fiber_fork
>> coro.suspend == fiber_suspend
>> coro.end(true) == fiber_end
>> coro.resume = fiber_join(%mem, i1 false)
>> coro.destroy = fiber_join(%mem, i1 true)
>> coro.size == 1,000,000 (or some other big number)
>>
>> Now let's walk through this example in the fiber model:
>>
>>    T coro() {
>>        %t = alloca
>>        %mem = malloc(...);
>>        (*%t) = 10
>>        if (coro.fork()) { // switches to fiber at label Start
>>    ReturnBB:
>>          %val = *%t // will be 10, as fiber has its own copy
>>          return some_code_creating_T(%val);
>>        }
>>    Start:
>>        Body Of The Coroutine
>>        With Suspends and All The Fun
>>    DeleteBB:
>>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>>        free(%mem)
>>        (*%t) = 20 ; // undef: as the fiber memory is freed already
>
> Say there was no (*%t) = 20. There is nothing so far that prevents
> sinking "(*%t) = 15" from before the free (where it is legal) to
after
> the free (where it is not).  %t is an unescaped alloca, and LLVM can
> move around loads from and stores to these (as a first approximation)
> as it pleases.
>
>>        coro.end(true) // switches back to the thread at ReturnBB
>>        UNREACHABLE // <== never reaches here. So we are honest with
optimizer
>>      }
>>
>>
>> Back to the LLVM based implementation
>> ====================================>>
>> In your example, there will be a path from the block where %t is used
to
>> where %t is defined that crosses a suspend point. Therefore %t will be
part of
>> the coroutine frame. So before the split, the program will be
transformed into:
>>
>>    coro.frame = type { int %t }
>>
>>    T coro() {
>>        %t = alloca
>>        %mem = malloc(...);
>>        %s = (coro.frame*)coro.init(%mem)
>>        s->%t = 10
>>        if (coro.fork()) {
>>    ReturnBB:
>>          %val = 10 // const.prop as 10, OK, but really undef
>>          return some_code_creating_T(%val);
>>        }
>>        Body Of The Coroutine
>>        With Suspends and All The Fun
>>    DeleteBB:
>>        free(%mem)
>>        s->%t = 20 //// BOOM writing to freed memory
>>        coro.end(true)
>>        UNREACHABLE
>>      }
>>
>> Accessing anything in the coroutine frame after coro.fork
"true" is undefined.
>> As by the time you get to returnBB, the coroutine frame could already
>> be destroyed.
>
> So when you say "some_code_creating_T()" are there restrictions
on
> what kind of code there could be as part of the creation logic (I've
> so far assumed that it can be arbitrary code).  If it can't
> (semantically) touch anything in the stack frame then why have it as
> part of @coro() ?
>
> -- Sanjoy

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jun 2016 - [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

[llvm-dev] [RFC] LLVM Coroutines

Maybe Matching Threads