Richard W.M. Jones
2022-Aug-30 09:07 UTC
[Libguestfs] [PATCH libnbd] ublk: Add new nbdublk program
On Tue, Aug 30, 2022 at 04:30:40PM +0800, Ming Lei wrote:> On Tue, Aug 30, 2022 at 09:04:07AM +0100, Richard W.M. Jones wrote: > > On Tue, Aug 30, 2022 at 10:32:02AM +0800, Ming Lei wrote: > > > Hi Jones, > > > > > > On Thu, Aug 25, 2022 at 01:10:55PM +0100, Richard W.M. Jones wrote: > > > > This patch adds simple support for a ublk-based NBD client. > > > > It is also available here: > > > > https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk/ublk > > > > > > > > ublk is a way to write Linux block device drivers in userspace: > > > > > > Just looked at your nbdublk implementation a bit, basically it is good, > > > and one really nice work. > > > > > > Also follows two suggestions: > > > > > > 1) the io_uring context is multilexed with ublk io command handling, so > > > we should avoid to block in both ->handle_io_async() and > > > ->handle_event(), otherwise performance may be bad > > > > The nbd_aio_* calls don't block. > > > > However I noticed that I made a mistake with the trim and zero paths > > because I am using synchronous (blocking) nbd_flush / nbd_trim / > > nbd_zero instead of nbd_aio_flush / nbd_aio_trim / nbd_aio_zero. I > > will fix this soon. > > > > Nothing in handle_event should block except for the call to > > pthread_mutex_lock. This lock is necessary because new commands can > > be retired on the nbd_work_thread while handle_event is being called > > from the io_uring thread. > > Yes, I meant the mutex. > > Also given nbd_work_thread() is required, it is reasonable to move > nbd_aio_* into nbd_work_thread(), so avoid to wait on the mutex in > io_uring context, then io command may not be forwarded to userspace > in time.nbd_aio_* calls shouldn't block. But how can we move those calls to nbd_work_thread()? If we queue up commands in handle_io_async we would still have to use a lock on that queue. I don't see how we can avoid some kind of locking in handle_io_async.> BTW what is the context for running callback of nbd_aio_*()?It depends on whether nbd_aio_* can complete the command entirely without blocking. The most common case is that nbd_aio_* calls a network function (eg. recv(2)) which returns EWOULDBLOCK. Later, on nbd_work_thread, recv(2) is completed, the command is processed, and command_completed() is called. It's unlikely, but possible, that the command_completed() function could be called directly from handle_io_async -> nbd_aio_* -> command_completed, eg. if recv(2) never blocks for some reason like it's talking over a Unix domain socket to a server which is very quick to reply.> > > 2) in the implementation of nbd worker thread, there are two sleep > > > points(wait for incoming io command, and network FD), I'd suggest to use > > > poll to wait on any of them > > > > > > Recently I are working to add ublksrv io offloading or aio > > > interfaces on this sort of case in which io_uring can't be used, > > > which may simplified this area, please see the attached patch which > > > applies the above two points against your patch. And obvious > > > improvement can be observed on my simple fio test( randread, io, 4k > > > bs, libaio) against backend of 'nbdkit file'. > > > > > > But these interfaces aren't merged to ublksrv github tree yet, you can find > > > them in the aio branch, and demo_event.c is one example wrt. how to use > > > them: > > > > > > https://github.com/ming1/ubdsrv/tree/aio > > > > > > Actually this interface can be improved further for nbdublk case, > > > and the request allocation isn't needed actually for this direct > > > offloading. But they are added for covering some IOs not from ublk > > > driver, such as meta data, so 'struct ublksrv_aio' is allocated. > > > I will try best to finalize them and merge to master branch. > > > > I didn't really understand what these patches to ubdsrv do when I > > looked at them before. Maybe add some diagrams? > > The patches are added recently. It is just for simplifying to offload > IO handling on another worker context, such as nbd_work_thread. > > It uses one eventfd to notify the target worker thread when new requests > are added to a list. Once the worker thread is wakeup, it fetches > requests from the list, after handle these requests, the worker thread > waits on both the eventfd and target IO(network FD) via poll().In the patch, it seems there is no place which creates a pthread for nbd_work_thread. Also ubdsrv aio branch does not call pthread_create. So I don't understand in what context nbd_work_thread is called. I've never used eventfd before and the man page for it is very opaque.> In your previous implementation, nbd work thread may wait on one pthread > mutex and aio_poll(), this way may not be efficient, given when waiting > on one event, another events can't be handled.I'm not sure what "event" means in this context. Does it mean "NBD command"? Or poll(2) event? There are multiple (usually 4) nbd_work threads, one for each NBD network connection. Each NBD network connection can handle many commands in flight at once. However if your patch is faster then something must be going on.> > > BTW, IOPS on nbdublk(backend: nbdkit file) still has big gap compared > > > with ublk-loop, so I guess in future maybe io_uring should be tried and > > > see if big improvement can be observed. > > > > It's always going to be a bit slower because we're converting the > > requests into a network protocol and passing them to another process. > > In my simple fio randread test, it can be 5+ ~ 10+ slower compared with loop.Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
On Tue, Aug 30, 2022 at 10:07:40AM +0100, Richard W.M. Jones wrote:> On Tue, Aug 30, 2022 at 04:30:40PM +0800, Ming Lei wrote: > > On Tue, Aug 30, 2022 at 09:04:07AM +0100, Richard W.M. Jones wrote: > > > On Tue, Aug 30, 2022 at 10:32:02AM +0800, Ming Lei wrote: > > > > Hi Jones, > > > > > > > > On Thu, Aug 25, 2022 at 01:10:55PM +0100, Richard W.M. Jones wrote: > > > > > This patch adds simple support for a ublk-based NBD client. > > > > > It is also available here: > > > > > https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk/ublk > > > > > > > > > > ublk is a way to write Linux block device drivers in userspace: > > > > > > > > Just looked at your nbdublk implementation a bit, basically it is good, > > > > and one really nice work. > > > > > > > > Also follows two suggestions: > > > > > > > > 1) the io_uring context is multilexed with ublk io command handling, so > > > > we should avoid to block in both ->handle_io_async() and > > > > ->handle_event(), otherwise performance may be bad > > > > > > The nbd_aio_* calls don't block. > > > > > > However I noticed that I made a mistake with the trim and zero paths > > > because I am using synchronous (blocking) nbd_flush / nbd_trim / > > > nbd_zero instead of nbd_aio_flush / nbd_aio_trim / nbd_aio_zero. I > > > will fix this soon. > > > > > > Nothing in handle_event should block except for the call to > > > pthread_mutex_lock. This lock is necessary because new commands can > > > be retired on the nbd_work_thread while handle_event is being called > > > from the io_uring thread. > > > > Yes, I meant the mutex. > > > > Also given nbd_work_thread() is required, it is reasonable to move > > nbd_aio_* into nbd_work_thread(), so avoid to wait on the mutex in > > io_uring context, then io command may not be forwarded to userspace > > in time. > > nbd_aio_* calls shouldn't block. But how can we move those calls to > nbd_work_thread()? If we queue up commands in handle_io_async wePlease see aio_submitter() in my patch, the call has been moved to nbd work thread already.> would still have to use a lock on that queue. I don't see how we can > avoid some kind of locking in handle_io_async.pthread_spin_lock isn't supposed to sleep, for protecting the list. Completely lockless is possible too by using per-queue bitmap to mark which requests is submitted & completed, one per-queue array to store the result. But per my test, nbd perf is still a bit low, so not sure this kind of optimization makes difference.> > > BTW what is the context for running callback of nbd_aio_*()? > > It depends on whether nbd_aio_* can complete the command entirely > without blocking. > > The most common case is that nbd_aio_* calls a network function > (eg. recv(2)) which returns EWOULDBLOCK. Later, on nbd_work_thread, > recv(2) is completed, the command is processed, and > command_completed() is called. > > It's unlikely, but possible, that the command_completed() function > could be called directly from handle_io_async -> nbd_aio_* -> > command_completed, eg. if recv(2) never blocks for some reason like > it's talking over a Unix domain socket to a server which is very quick > to reply.OK, got it. I just tried to understand the whole flow.> > > > > 2) in the implementation of nbd worker thread, there are two sleep > > > > points(wait for incoming io command, and network FD), I'd suggest to use > > > > poll to wait on any of them > > > > > > > > Recently I are working to add ublksrv io offloading or aio > > > > interfaces on this sort of case in which io_uring can't be used, > > > > which may simplified this area, please see the attached patch which > > > > applies the above two points against your patch. And obvious > > > > improvement can be observed on my simple fio test( randread, io, 4k > > > > bs, libaio) against backend of 'nbdkit file'. > > > > > > > > But these interfaces aren't merged to ublksrv github tree yet, you can find > > > > them in the aio branch, and demo_event.c is one example wrt. how to use > > > > them: > > > > > > > > https://github.com/ming1/ubdsrv/tree/aio > > > > > > > > Actually this interface can be improved further for nbdublk case, > > > > and the request allocation isn't needed actually for this direct > > > > offloading. But they are added for covering some IOs not from ublk > > > > driver, such as meta data, so 'struct ublksrv_aio' is allocated. > > > > I will try best to finalize them and merge to master branch. > > > > > > I didn't really understand what these patches to ubdsrv do when I > > > looked at them before. Maybe add some diagrams? > > > > The patches are added recently. It is just for simplifying to offload > > IO handling on another worker context, such as nbd_work_thread. > > > > It uses one eventfd to notify the target worker thread when new requests > > are added to a list. Once the worker thread is wakeup, it fetches > > requests from the list, after handle these requests, the worker thread > > waits on both the eventfd and target IO(network FD) via poll(). > > In the patch, it seems there is no place which creates a pthread for > nbd_work_thread. Also ubdsrv aio branch does not call pthread_create. > So I don't understand in what context nbd_work_thread is called.nbd work thread is created by nbd target code just like before, but the thread is changed to the following way, basically bound with one aio_ctx: while (!ublksrv_aio_ctx_dead(aio_ctx)) { struct aio_list compl; aio_list_init(&compl); //retrieve requests from submit list, and submit each one via //aio_submitter(), if anyone is done, add it to &compl. ublksrv_aio_submit_worker(aio_ctx, aio_submitter, &compl); //add requests completed from command_completed() to &compl pthread_spin_lock(&c->lock); aio_list_splice(&c->list, &compl); pthread_spin_unlock(&c->lock); //notify io_uring thread for the completed requests in &compl, //then batching complete & re-issue can be done in io_uring //context ublksrv_aio_complete_worker(aio_ctx, &compl); //wait for network IO and evevfd from io_uring at the same time //so if either one is ready, nbd_poll2() will return from sleep if (nbd_poll2 (h, aio_ctx->efd, -1) == -1) { fprintf (stderr, "%s\n", nbd_get_error ()); exit (EXIT_FAILURE); } }> > I've never used eventfd before and the man page for it is very opaque. > > > In your previous implementation, nbd work thread may wait on one pthread > > mutex and aio_poll(), this way may not be efficient, given when waiting > > on one event, another events can't be handled. > > I'm not sure what "event" means in this context. Does it mean > "NBD command"? Or poll(2) event?Here event is generic, I meant: NBD IO ready(exactly what aio_poll() waits for) or io_uring eventfd ready(one write done from nbd_handle_io_async()).> > There are multiple (usually 4) nbd_work threads, one for each NBD > network connection. Each NBD network connection can handle many > commands in flight at once.OK, but aio_poll() supposes to get notified if one command is done, so here it is just the implementation detail, but correct me if I am wrong. Thanks, Ming