The concern is a client is blocked while processing a request. The nbdkit
server design requires a thread per request being processed regardless of
the number of connections or clients. We want to run 1000's of requests in
parallel without needing a thread at nbdkit layer per request in flight.
Our plugin layer is built around boost asio and a few threads in a worker
pool running an io service can be processing 1000s of requests in parallel.
(Our plugin is a gateway of sorts and requests are sent back out over the
network. While our plugin waits for the read or write data we don't block
in a thread we handle other requests that are ready).
The current nbdkit server design requires a thread per request in progress
because it is built around a synchronous callback to the plugin layer and
the main recv_request_send_reply loop holds the only copy of the request
handle that is needed to make the reply.
A more flexible design would be the recv_request_send_reply loop is instead
split into a recv_request loop and a send_reply func. The recv_request loop
forwards the request handle to the handle_request call. The existing
nbdkit_plugin struct would function identically and send_reply is called
after the plugin.pread + fua is finished.
An alternative plugin struct "nbdkit_plugin_lowlevel" could define a
different interface where an opaque ptr to the handle + connection + flags
are passed in and the plugin is required to call nbdkit_reply (opaque *ptr,
...) to send the reply to the nbd client rather than nbdkit auto sending
the reply after returning from the plugin function.
Some pseudo code example changes for what I had in mind.
struct operation {
struct connection *conn;
uint64_t handle;
uint32_t cmd;
uint32_t flags;
char *buf;
uint32_t count;
};
// unchanged
struct nbdkit_plugin {
...
int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t
offset);
int (*flush) (void *handle);
int (*trim) (void *handle, uint32_t count, uint64_t offset);
int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);
}
// new lowlevel api
struct nbdkit_plugin_lowlevel {
...
int (*pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t
offset);
int (*pwrite) (void *op, void *handle, void *buf, uint32_t count,
uint64_t offset);
int (*flush) (void *op, void *handle);
int (*trim) (void *op, void *handle, uint32_t count, uint64_t offset);
int (*zero) (void *op, void *handle, uint32_t count, uint64_t offset, int
may_trim);
};
// Called by the lowlevel api to send a reply to the client
void
nbdkit_reply(void *vop)
{
int r;
bool flush_after_command;
struct operation *op = (struct operation *) vop;
flush_after_command = (op->flags & NBD_CMD_FLAG_FUA) != 0;
if (!op->conn->can_flush || op->conn->readonly)
flush_after_command = false;
if (flush_after_command) {
op->flags = 0; // clear flags
r = plugin_flush_lowlevel (op, op->conn->handle);
if (r == -1)
// do error stuff;
}
else if (op->cmd == NBD_CMD_READ)
send_reply(op, op->buf, op->count, 0);
else
send_reply(op, NULL, 0, 0);
}
int
my_plugin_pwrite_lowlevel (void *op, void *handle, void *buf,
uint32_t count, uint64_t offset)
{
if (is_readonly(handle)) {
nbdkit_reply_error (op, EROFS);
return 0;
}
if (some critically bad issue)
return -1;
// this returns right away before write has completed and when it
// does complete calls the handler lambda
my_storage.async_write(buf, count, offset,
[op, count](const boost::asio::error_code & ec) {
if (ec)
nbdkit_reply_error(op, ec.value());
else
nbdkit_reply(op);
});
return 0;
}
// connections.c
static int
_send_reply (struct operation *op, uint32_t count, void *buf, uint32_t
error)
{
int r;
struct reply reply;
reply.magic = htobe32 (NBD_REPLY_MAGIC);
reply.handle = op->handle;
reply.error = htobe32 (nbd_errno (error));
if (error != 0) {
/* Since we're about to send only the limited NBD_E* errno to the
* client, don't lose the information about what really happened
* on the server side. Make sure there is a way for the operator
* to retrieve the real error.
*/
debug ("sending error reply: %s", strerror (error));
}
r = xwrite (op->conn->sockout, &reply, sizeof reply);
if (r == -1) {
nbdkit_error ("write reply: %m");
return -1;
}
if (op->cmd == NBD_CMD_READ) { /* Send the read data buffer. */
r = xwrite (op->conn->sockout, buf, count);
if (r == -1) {
nbdkit_error ("write data: %m");
return -1;
}
}
return 0;
}
// new mutex on writes due to parallel nature of responding to the socket
int
send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error)
{
int r;
plugin_lock_reply (op->conn);
r = _send_reply (op, count, buf, error);
plugin_unlock_reply (op->conn);
free (op);
return r;
}
On Mon, Feb 20, 2017 at 4:03 AM, Richard W.M. Jones <rjones@redhat.com>
wrote:
> > ----- Forwarded message -----
> >
> > Date: Sat, 18 Feb 2017 22:21:19 -0500
> > Subject: nbdkit async
> >
> > Hello,
> >
> > Hope this is the right person to contact regarding nbdkit design.
> >
> > I have a high latency massively parallel device that I am currently
> > implementing as an nbdkit plugin in c++ and have run into some design
> > limitations due to the synchronous callback interface nbdkit requires.
>
> Is the concern that each client requires a single thread, consuming
> memory (eg for stack space), but because of the high latency plugin
> these threads will be hanging around not doing very much? And/or is
> it that the client is blocked while servicing each request?
>
> > Nbdkit is currently designed to call the plugin
> > pread/pwrite/trim/flush/zero ops as synchronous calls and expects when
> the
> > plugin functions return that it can then send the nbd reply to the
> socket.
> >
> > It's parallel thread model is also not implemented as of yet
>
> I think this bit can be fixed fairly easily. One change which is
> especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN
> flag (under control of the plugin).
>
> Anyway this doesn't solve your problem ...
>
> > but the
> > current design still mandates a worker thread per parallel op in
progress
> > due to the synchronous design of the plugin calls.
>
> And the synchronous / 1 thread per client design of the server.
>
> > I would like to modify this to allow for an alternative operating mode
> > where nbdkit calls the plugin functions and expects the plugin to
> callback
> > to nbdkit when a request has completed rather than responding right
after
> > the plugin call returns to nbdkit.
> >
> > If you are familiar with fuse low level api design, something very
> similar
> > to that.
> >
> > An example flow for a read request would be as follows:
> >
> > 1) nbdkit reads and validates the request from the socket
> > 2) nbdkit calls handle_request but now also passing in the nbd request
> > handle value
> > 3) nbdkit bundles the nbd request handle value, bool flush_on_update,
and
> > read size into an opaque ptr to struct
> > 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque
> ptr
>
> We can't change the existing API, so this would have to be exposed
> through new plugin entry point(s).
>
> > 5) my_plugin.pread makes an asynchronous read call with a handler set
on
> > completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on
error
> > nbdkit_reply_error(conn, opaque_ptr, error)
> > 6) my_plugin.pread returns back to nbdkit without error after it has
> > started the async op but before it has completed
> > 7) nbdkit doesn't send a response to the conn->sockout beause
when the
> > async op has completed my_plugin will callback to nbdkit for it to
send
> the
> > response
> > 8) nbdkit loop continues right away on the next request and it reads
and
> > validates the next request from conn->sockin without waiting for
the
> > previous request to complete
> > *) Now requires an additional mutex on the conn->sockout for
writing
> > responses
> >
> > The benefit of this approach is that 2 threads (1 thread for reading
> > requests from the socket and kicking off requests to the plugin and 1
> > thread (or more) in a worker pool executing the async handler
callbacks)
> > can service 100s of slow nbd requests in parallel overcoming high
> latency.
> >
> > The current design with synchronous callbacks we can provide async in
our
> > plugin layer for pwrites and implement our flush to enforce it but we
> can't
> > get around a single slow high latency read op blocking the entire
pipe.
> >
> > I'm willing to work on this in a branch and push this up as
opensource
> but
> > first wanted to check if this functionality extension is in fact
> something
> > redhat would want for nbdkit and if so if there were suggestions to
the
> > implementation.
>
> It depends on how much it complicates the internals of nbdkit (which
> are currently quite simple). Would need to see the patches ...
>
> You can help by splitting into simple changes which are generally
> applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes
> which are more difficult to integrate.
>
> > Initial implementation approach was going to be similar to the
> > fuse_low_level approach and create an entirely separate header file
for
> the
> > asynchronous plugin api because the plugin calls now need an
additional
> > parameter (opaque ptr to handle for nbdkit_reply_). This header file
> > nbdkit_plugin_async.h defines the plugin struct with slightly
different
> > function ptr prototypes that accepts the opaque ptr to nbd request
> handle
> > and some additional callback functions nbdkit_reply_error,
nbdkit_reply,
> > and nbdkit_reply_read. The user of this plugin interface is required
to
> > call either nbdkit_reply_error or nbdkit_reply[_read] in each of the
> > pread/pwrite/flush/trim/zero ops.
> >
> > If you got this far thank you for the long read and please let me know
if
> > there is any interest.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~
> rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-top is 'top' for virtual machines. Tiny program with many
> powerful monitoring features, net stats, disk stats, logging, etc.
> http://people.redhat.com/~rjones/virt-top
>