thr3ads.net - Libguestfs - Re: [Libguestfs] Fwd: nbdkit async [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2017-Feb-19 16:34 UTC

[Libguestfs] Fwd: nbdkit async

----- Forwarded message -----

Date: Sat, 18 Feb 2017 22:21:19 -0500
Subject: nbdkit async

Hello,

Hope this is the right person to contact regarding nbdkit design.

I have a high latency massively parallel device that I am currently
implementing as an nbdkit plugin in c++ and have run into some design
limitations due to the synchronous callback interface nbdkit requires.

Nbdkit is currently designed to call the plugin
pread/pwrite/trim/flush/zero ops as synchronous calls and expects when the
plugin functions return that it can then send the nbd reply to the socket.
It's parallel thread model is also not implemented as of yet but the
current design still mandates a worker thread per parallel op in progress
due to the synchronous design of the plugin calls.

I would like to modify this to allow for an alternative operating mode
where nbdkit calls the plugin functions and expects the plugin to callback
to nbdkit when a request has completed rather than responding right after
the plugin call returns to nbdkit.

If you are familiar with fuse low level api design, something very similar
to that.

An example flow for a read request would be as follows:

1) nbdkit reads and validates the request from the socket
2) nbdkit calls handle_request but now also passing in the nbd request
handle value
3) nbdkit bundles the nbd request handle value, bool flush_on_update, and
read size into an opaque ptr to struct
4) nbdkit calls my_plugin.pread passing in the usual args + the opaque ptr
5) my_plugin.pread makes an asynchronous read call with a handler set on
completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on error
nbdkit_reply_error(conn, opaque_ptr, error)
6) my_plugin.pread returns back to nbdkit without error after it has
started the async op but before it has completed
7) nbdkit doesn't send a response to the conn->sockout beause  when the
async op has completed my_plugin will callback to nbdkit for it to send the
response
8) nbdkit loop continues right away on the next request and it reads and
validates the next request from conn->sockin without waiting for the
previous request to complete
*) Now requires an additional mutex on the conn->sockout for writing
responses

The benefit of this approach is that 2 threads (1 thread for reading
requests from the socket and kicking off requests to the plugin and 1
thread (or more) in a worker pool executing the async handler callbacks)
can service 100s of slow nbd requests in parallel overcoming high latency.

The current design with synchronous callbacks we can provide async in our
plugin layer for pwrites and implement our flush to enforce it but we can't
get around a single slow high latency read op blocking the entire pipe.

I'm willing to work on this in a branch and push this up as opensource but
first wanted to check if this functionality extension is in fact something
redhat would want for nbdkit and if so if there were suggestions to the
implementation.

Initial implementation approach was going to be similar to the
fuse_low_level approach and create an entirely separate header file for the
asynchronous plugin api because the plugin calls now need an additional
parameter (opaque ptr to handle for nbdkit_reply_). This header file
nbdkit_plugin_async.h defines the plugin struct with slightly different
function ptr prototypes that  accepts the opaque ptr to nbd request handle
and some additional callback functions nbdkit_reply_error, nbdkit_reply,
and nbdkit_reply_read. The user of this plugin interface is required to
call either nbdkit_reply_error or nbdkit_reply[_read] in each of the
pread/pwrite/flush/trim/zero ops.

If you got this far thank you for the long read and please let me know if
there is any interest.
Thank you,
 - Shaun McDowell

----- End forwarded message -----

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Richard W.M. Jones

2017-Feb-20 09:03 UTC

head link

Re: [Libguestfs] Fwd: nbdkit async

> ----- Forwarded message -----
> 
> Date: Sat, 18 Feb 2017 22:21:19 -0500
> Subject: nbdkit async
> 
> Hello,
> 
> Hope this is the right person to contact regarding nbdkit design.
> 
> I have a high latency massively parallel device that I am currently
> implementing as an nbdkit plugin in c++ and have run into some design
> limitations due to the synchronous callback interface nbdkit requires.
Is the concern that each client requires a single thread, consuming
memory (eg for stack space), but because of the high latency plugin
these threads will be hanging around not doing very much?  And/or is
it that the client is blocked while servicing each request?
> Nbdkit is currently designed to call the plugin
> pread/pwrite/trim/flush/zero ops as synchronous calls and expects when the
> plugin functions return that it can then send the nbd reply to the socket.
>
> It's parallel thread model is also not implemented as of yet
I think this bit can be fixed fairly easily.  One change which is
especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN
flag (under control of the plugin).

Anyway this doesn't solve your problem ...
> but the
> current design still mandates a worker thread per parallel op in progress
> due to the synchronous design of the plugin calls.
And the synchronous / 1 thread per client design of the server.
> I would like to modify this to allow for an alternative operating mode
> where nbdkit calls the plugin functions and expects the plugin to callback
> to nbdkit when a request has completed rather than responding right after
> the plugin call returns to nbdkit.
> 
> If you are familiar with fuse low level api design, something very similar
> to that.
> 
> An example flow for a read request would be as follows:
> 
> 1) nbdkit reads and validates the request from the socket
> 2) nbdkit calls handle_request but now also passing in the nbd request
> handle value
> 3) nbdkit bundles the nbd request handle value, bool flush_on_update, and
> read size into an opaque ptr to struct
> 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque ptr
We can't change the existing API, so this would have to be exposed
through new plugin entry point(s).
> 5) my_plugin.pread makes an asynchronous read call with a handler set on
> completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on error
> nbdkit_reply_error(conn, opaque_ptr, error)
> 6) my_plugin.pread returns back to nbdkit without error after it has
> started the async op but before it has completed
> 7) nbdkit doesn't send a response to the conn->sockout beause  when
the
> async op has completed my_plugin will callback to nbdkit for it to send the
> response
> 8) nbdkit loop continues right away on the next request and it reads and
> validates the next request from conn->sockin without waiting for the
> previous request to complete
> *) Now requires an additional mutex on the conn->sockout for writing
> responses
> 
> The benefit of this approach is that 2 threads (1 thread for reading
> requests from the socket and kicking off requests to the plugin and 1
> thread (or more) in a worker pool executing the async handler callbacks)
> can service 100s of slow nbd requests in parallel overcoming high latency.
> 
> The current design with synchronous callbacks we can provide async in our
> plugin layer for pwrites and implement our flush to enforce it but we
can't
> get around a single slow high latency read op blocking the entire pipe.
> 
> I'm willing to work on this in a branch and push this up as opensource
but
> first wanted to check if this functionality extension is in fact something
> redhat would want for nbdkit and if so if there were suggestions to the
> implementation.
It depends on how much it complicates the internals of nbdkit (which
are currently quite simple).  Would need to see the patches ...

You can help by splitting into simple changes which are generally
applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes
which are more difficult to integrate.
> Initial implementation approach was going to be similar to the
> fuse_low_level approach and create an entirely separate header file for the
> asynchronous plugin api because the plugin calls now need an additional
> parameter (opaque ptr to handle for nbdkit_reply_). This header file
> nbdkit_plugin_async.h defines the plugin struct with slightly different
> function ptr prototypes that  accepts the opaque ptr to nbd request handle
> and some additional callback functions nbdkit_reply_error, nbdkit_reply,
> and nbdkit_reply_read. The user of this plugin interface is required to
> call either nbdkit_reply_error or nbdkit_reply[_read] in each of the
> pread/pwrite/flush/trim/zero ops.
> 
> If you got this far thank you for the long read and please let me know if
> there is any interest.
Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Shaun McDowell

2017-Feb-20 20:32 UTC

head link

Re: [Libguestfs] Fwd: nbdkit async

The concern is a client is blocked while processing a request. The nbdkit
server design requires a thread per request being processed regardless of
the number of connections or clients. We want to run 1000's of requests in
parallel without needing a thread at nbdkit layer per request in flight.

Our plugin layer is built around boost asio and a few threads in a worker
pool running an io service can be processing 1000s of requests in parallel.
(Our plugin is a gateway of sorts and requests are sent back out over the
network. While our plugin waits for the read or write data we don't block
in a thread we handle other requests that are ready).

The current nbdkit server design requires a thread per request in progress
because it is built around a synchronous callback to the plugin layer and
the main recv_request_send_reply loop holds the only copy of the request
handle that is needed to make the reply.

A more flexible design would be the recv_request_send_reply loop is instead
split into a recv_request loop and a send_reply func. The recv_request loop
forwards the request handle to the handle_request call. The existing
nbdkit_plugin struct would function identically and send_reply is called
after the plugin.pread + fua is finished.

An alternative plugin struct "nbdkit_plugin_lowlevel" could define a
different interface where an opaque ptr to the handle + connection + flags
are passed in and the plugin is required to call nbdkit_reply (opaque *ptr,
...) to send the reply to the nbd client rather than nbdkit auto sending
the reply after returning from the plugin function.

Some pseudo code example changes for what I had in mind.

struct operation {
    struct connection *conn;
    uint64_t handle;
    uint32_t cmd;
    uint32_t flags;
    char *buf;
    uint32_t count;
};

// unchanged
struct nbdkit_plugin {
  ...
  int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
  int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t
offset);
  int (*flush) (void *handle);
  int (*trim) (void *handle, uint32_t count, uint64_t offset);
  int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);
}

// new lowlevel api
struct nbdkit_plugin_lowlevel {
  ...
  int (*pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t
offset);
  int (*pwrite) (void *op, void *handle, void *buf, uint32_t count,
uint64_t offset);
  int (*flush) (void *op, void *handle);
  int (*trim) (void *op, void *handle, uint32_t count, uint64_t offset);
  int (*zero) (void *op, void *handle, uint32_t count, uint64_t offset, int
may_trim);
};

// Called by the lowlevel api to send a reply to the client
void
nbdkit_reply(void *vop)
{
  int r;
  bool flush_after_command;
  struct operation *op = (struct operation *) vop;

  flush_after_command = (op->flags & NBD_CMD_FLAG_FUA) != 0;
  if (!op->conn->can_flush || op->conn->readonly)
    flush_after_command = false;

  if (flush_after_command) {
    op->flags = 0; // clear flags
    r = plugin_flush_lowlevel (op, op->conn->handle);
    if (r == -1)
      // do error stuff;
  }
  else if (op->cmd == NBD_CMD_READ)
    send_reply(op, op->buf, op->count, 0);
  else
    send_reply(op, NULL, 0, 0);
}

int
my_plugin_pwrite_lowlevel (void *op, void *handle, void *buf,
                                        uint32_t count, uint64_t offset)
{
  if (is_readonly(handle)) {
    nbdkit_reply_error (op, EROFS);
    return 0;
  }

  if (some critically bad issue)
    return -1;

  // this returns right away before write has completed and when it
  // does complete calls the handler lambda
  my_storage.async_write(buf, count, offset,
           [op, count](const boost::asio::error_code & ec) {
      if (ec)
        nbdkit_reply_error(op, ec.value());
      else
        nbdkit_reply(op);
    });

  return 0;
}

// connections.c
static int
_send_reply (struct operation *op, uint32_t count, void *buf, uint32_t
error)
{
  int r;
  struct reply reply;
  reply.magic = htobe32 (NBD_REPLY_MAGIC);
  reply.handle = op->handle;
  reply.error = htobe32 (nbd_errno (error));

  if (error != 0) {
    /* Since we're about to send only the limited NBD_E* errno to the
     * client, don't lose the information about what really happened
     * on the server side.  Make sure there is a way for the operator
     * to retrieve the real error.
     */
    debug ("sending error reply: %s", strerror (error));
  }

  r = xwrite (op->conn->sockout, &reply, sizeof reply);
  if (r == -1) {
    nbdkit_error ("write reply: %m");
    return -1;
  }

  if (op->cmd == NBD_CMD_READ) { /* Send the read data buffer. */
    r = xwrite (op->conn->sockout, buf, count);
    if (r == -1) {
      nbdkit_error ("write data: %m");
      return -1;
    }
  }

  return 0;
}

// new mutex on writes due to parallel nature of responding to the socket
int
send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error)
{
  int r;

  plugin_lock_reply (op->conn);
  r = _send_reply (op, count, buf, error);
  plugin_unlock_reply (op->conn);

  free (op);
  return r;
}

On Mon, Feb 20, 2017 at 4:03 AM, Richard W.M. Jones <rjones@redhat.com>
wrote:
> > ----- Forwarded message -----
> >
> > Date: Sat, 18 Feb 2017 22:21:19 -0500
> > Subject: nbdkit async
> >
> > Hello,
> >
> > Hope this is the right person to contact regarding nbdkit design.
> >
> > I have a high latency massively parallel device that I am currently
> > implementing as an nbdkit plugin in c++ and have run into some design
> > limitations due to the synchronous callback interface nbdkit requires.
>
> Is the concern that each client requires a single thread, consuming
> memory (eg for stack space), but because of the high latency plugin
> these threads will be hanging around not doing very much?  And/or is
> it that the client is blocked while servicing each request?
>
> > Nbdkit is currently designed to call the plugin
> > pread/pwrite/trim/flush/zero ops as synchronous calls and expects when
> the
> > plugin functions return that it can then send the nbd reply to the
> socket.
> >
> > It's parallel thread model is also not implemented as of yet
>
> I think this bit can be fixed fairly easily.  One change which is
> especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN
> flag (under control of the plugin).
>
> Anyway this doesn't solve your problem ...
>
> > but the
> > current design still mandates a worker thread per parallel op in
progress
> > due to the synchronous design of the plugin calls.
>
> And the synchronous / 1 thread per client design of the server.
>
> > I would like to modify this to allow for an alternative operating mode
> > where nbdkit calls the plugin functions and expects the plugin to
> callback
> > to nbdkit when a request has completed rather than responding right
after
> > the plugin call returns to nbdkit.
> >
> > If you are familiar with fuse low level api design, something very
> similar
> > to that.
> >
> > An example flow for a read request would be as follows:
> >
> > 1) nbdkit reads and validates the request from the socket
> > 2) nbdkit calls handle_request but now also passing in the nbd request
> > handle value
> > 3) nbdkit bundles the nbd request handle value, bool flush_on_update,
and
> > read size into an opaque ptr to struct
> > 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque
> ptr
>
> We can't change the existing API, so this would have to be exposed
> through new plugin entry point(s).
>
> > 5) my_plugin.pread makes an asynchronous read call with a handler set
on
> > completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on
error
> > nbdkit_reply_error(conn, opaque_ptr, error)
> > 6) my_plugin.pread returns back to nbdkit without error after it has
> > started the async op but before it has completed
> > 7) nbdkit doesn't send a response to the conn->sockout beause 
when the
> > async op has completed my_plugin will callback to nbdkit for it to
send
> the
> > response
> > 8) nbdkit loop continues right away on the next request and it reads
and
> > validates the next request from conn->sockin without waiting for
the
> > previous request to complete
> > *) Now requires an additional mutex on the conn->sockout for
writing
> > responses
> >
> > The benefit of this approach is that 2 threads (1 thread for reading
> > requests from the socket and kicking off requests to the plugin and 1
> > thread (or more) in a worker pool executing the async handler
callbacks)
> > can service 100s of slow nbd requests in parallel overcoming high
> latency.
> >
> > The current design with synchronous callbacks we can provide async in
our
> > plugin layer for pwrites and implement our flush to enforce it but we
> can't
> > get around a single slow high latency read op blocking the entire
pipe.
> >
> > I'm willing to work on this in a branch and push this up as
opensource
> but
> > first wanted to check if this functionality extension is in fact
> something
> > redhat would want for nbdkit and if so if there were suggestions to
the
> > implementation.
>
> It depends on how much it complicates the internals of nbdkit (which
> are currently quite simple).  Would need to see the patches ...
>
> You can help by splitting into simple changes which are generally
> applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes
> which are more difficult to integrate.
>
> > Initial implementation approach was going to be similar to the
> > fuse_low_level approach and create an entirely separate header file
for
> the
> > asynchronous plugin api because the plugin calls now need an
additional
> > parameter (opaque ptr to handle for nbdkit_reply_). This header file
> > nbdkit_plugin_async.h defines the plugin struct with slightly
different
> > function ptr prototypes that  accepts the opaque ptr to nbd request
> handle
> > and some additional callback functions nbdkit_reply_error,
nbdkit_reply,
> > and nbdkit_reply_read. The user of this plugin interface is required
to
> > call either nbdkit_reply_error or nbdkit_reply[_read] in each of the
> > pread/pwrite/flush/trim/zero ops.
> >
> > If you got this far thank you for the long read and please let me know
if
> > there is any interest.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~
> rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-top is 'top' for virtual machines.  Tiny program with many
> powerful monitoring features, net stats, disk stats, logging, etc.
> http://people.redhat.com/~rjones/virt-top
>

Apparently Analagous Threads

Search for more possibly parallel threads

Libguestfs - Feb 2017 - Re: Fwd: nbdkit async

[Libguestfs] Fwd: nbdkit async

Re: [Libguestfs] Fwd: nbdkit async

Re: [Libguestfs] Fwd: nbdkit async

Apparently Analagous Threads