Richard W.M. Jones
2019-May-31 16:01 UTC
[Libguestfs] [libnbd] Simultaneous read and write
This is a continuation of a discussion we were having on IRC. The problems with IRC are it's not recorded and it's hard to have deep technical conversations. I hope this is a decent summary. Problem simply stated: Certain NBD servers (qemu-nbd in particular) are able to simultaneously read and write on a socket. ie. They can be simultaneously reading a request and writing the reply to a previous request. However libnbd is unable to do this trick. Although multiple requests can be in flight, libnbd is either writing on the socket or reading from the socket, but never both. Visualized it looks like this: write write write |===========|______________|=======|_______|===| --> rq to server _____________|============|_________|=====|_____ <-- rp from server read read whereas an ideal libnbd which could write and read simultaneously, coupled with a server which can do the same: write |===========||=======||===| --> requests to server |============||=====||===== <-- replies from server read Eric already observed through testing that the ideal client can be up to twice as fast as libnbd, as is obvious from the diagram. The reasons why libnbd can't do this are: Problem (a) : Only one thread of control can hold the libnbd handle lock (h->lock), and since there can only be one thread of control running in each handle at any time, that thread can only be reading or writing. Problem (b) : There is only one state machine per handle (h->state), whereas to handle the write and read sides separately requires two state machines. In the IRC discussion we gave these the preliminary names h->wstate and h->rstate. ---------------------------------------------------------------------- It's worth also saying how the current API works, although we might want to change it. You grab the underlying file descriptor using nbd_aio_get_fd, which is what you poll on. You also have to call nbd_aio_get_direction which returns READ, WRITE or BOTH (== READ|WRITE). You then set up some events mechanism (eg. poll, epoll, etc.), poll until the file descriptor is ready, and call one of nbd_aio_notify_read or nbd_aio_notify_write. The direction can change any time the handle state changes, which includes whenever you issue a command (eg. nbd_aio_pread), or whenever you call nbd_aio_notify_*. You therefore have to call nbd_aio_get_direction frequently. A typical loop using poll might look like: fd = nbd_aio_get_fd (nbd); for (;;) { /* <-- If you need to issue more commands, do that here. */ dir = nbd_aio_get_direction (nbd); pollfd[0].fd = fd; pollfd[0].events = 0; if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN; if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT; poll (pollfd, 1, -1); if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_READ) nbd_aio_notify_read (); else if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_WRITE) nbd_aio_notify_write (); } ---------------------------------------------------------------------- The above code is of course assuming a single thread. But to simultaneously read and write we will need at least two threads. It's hard for me to see a nice way to evolve the API to support multiple threads, but I guess issues include: - If one thread is waiting in poll(2) and another thread issues a command, how do we get the first thread to return from poll and reevaluate direction? (Eric suggests a pipe-to-self for this) - If two threads are waiting in poll(2) and both are notified that the fd is ready, how do we get one of them to read and the other to write? I think this implies that we don't have two threads doing poll(2), but if not how do we farm out the read/write work to two or more threads? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
On 5/31/19 11:01 AM, Richard W.M. Jones wrote:> This is a continuation of a discussion we were having on IRC. The > problems with IRC are it's not recorded and it's hard to have deep > technical conversations. I hope this is a decent summary. > > Problem simply stated: Certain NBD servers (qemu-nbd in particular)Also nbdkit for a plugin with NBDKIT_THREAD_MODEL_PARALLEL. qemu-nbd does it via coroutines and nbdkit via threads, but the concept is similar: a request-reply cycle is handled within one thread, but that thread relinquishes read control to any other thread prior to starting to write the reply, so the next thread in line can be reading in parallel.> are able to simultaneously read and write on a socket. ie. They can > be simultaneously reading a request and writing the reply to a > previous request. However libnbd is unable to do this trick. > Although multiple requests can be in flight, libnbd is either writing > on the socket or reading from the socket, but never both. > > Visualized it looks like this: > > write write write > |===========|______________|=======|_______|===| --> rq to server > _____________|============|_________|=====|_____ <-- rp from server > read read > > whereas an ideal libnbd which could write and read simultaneously, > coupled with a server which can do the same: > > write > |===========||=======||===| --> requests to server > |============||=====||===== <-- replies from server > read > > Eric already observed through testing that the ideal client can be up > to twice as fast as libnbd, as is obvious from the diagram.More concretely, the existing nbdkit-nbd plugin uses multiple threads (as provided by nbdkit) to write requests to the server (using a mutex so that requests are not interleaved), then a separate reader thread to read responses from the server (with semaphores to learn which writer thread to wake up when the response is complete). The ideal division of labor that would permit reads to be in a separate thread from writes would require that the nbd_aio_pread() call (in the nbdkit thread) run until the entire request is out the door, even if it temporarily blocks on write() and has to wait for POLLOUT. Meanwhile, the reader thread would spin waiting for just POLLIN events, rather than having to service any of the POLLOUT of the writer thread(s) running in parallel.> > The reasons why libnbd can't do this are: > > Problem (a) : Only one thread of control can hold the libnbd handle > lock (h->lock), and since there can only be one thread of control > running in each handle at any time, that thread can only be reading or > writing.Indeed, when the writer thread calls nbd_aio_pread(), it is contending on the same lock as the nbd_aio_notify_read() in the reader thread, so we'd have to split up to several finer-grained locks (maybe keep all locking APIs with a grab on h->lock at the beginning, but while holding that lock we then grab the h->rstate or h->wstate lock for the rest of the call, and drop the main h->lock at that point even though the API doesn't return until the state machine blocks again). With my initial use of libnbd, the division of labor for which thread writes a packet falls into three classes: - if the state machine is ready, the libnbd fd is polling only on POLLIN; from there: - if the request is small, the entire request is written by the nbdkit thread during nbd_aio_pwrite (in this case, our write to the pipe-to-self is actually wasted work, because the reader loop thread will spuriously wake up, check the directions, and see that it still only needs POLLIN for the next server response) - if the request is large, the first half of the request is written from the nbdkit during nbd_aio_pwrite, and the second half of the request is written from the reader thread during nbd_aio_notify_write (here, our write to the pipe-to-self is essential, because we HAD to break the reader loop out of its POLLIN poll() in order to learn that it now needs to poll for POLLIN|POLLOUT) - if the state machine is not ready, the libnbd fd is either in the middle of processing a reply (POLLIN) or a request (POLLIN|POLLOUT, since we allow replies to interrupt the processing of a request). nbd_aio_pwrite queues the new request and returns immediately, without write()ing anything from the nbdkit thread, and the entire request gets written by the reader thread during subsequent nbd_aio_notify_{read,write} (in this case, our write to the pipe-to-self is useful if libnbd was in the middle of processing a reply, but wasted work if it was in the middle of processing a previous request)> > Problem (b) : There is only one state machine per handle (h->state), > whereas to handle the write and read sides separately requires two > state machines. In the IRC discussion we gave these the preliminary > names h->wstate and h->rstate. > > ---------------------------------------------------------------------- > > It's worth also saying how the current API works, although we might > want to change it. > > You grab the underlying file descriptor using nbd_aio_get_fd, which is > what you poll on. You also have to call nbd_aio_get_direction which > returns READ, WRITE or BOTH (== READ|WRITE). You then set up some > events mechanism (eg. poll, epoll, etc.), poll until the file > descriptor is ready, and call one of nbd_aio_notify_read or > nbd_aio_notify_write. > > The direction can change any time the handle state changes, which > includes whenever you issue a command (eg. nbd_aio_pread), or whenever > you call nbd_aio_notify_*. You therefore have to call > nbd_aio_get_direction frequently. > > A typical loop using poll might look like: > > fd = nbd_aio_get_fd (nbd); > for (;;) { > /* <-- If you need to issue more commands, do that here. */ > dir = nbd_aio_get_direction (nbd); > pollfd[0].fd = fd; > pollfd[0].events = 0; > if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN; > if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT; > poll (pollfd, 1, -1); > if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_READ)Rather, if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ)> nbd_aio_notify_read (); > else if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_WRITE)and again> nbd_aio_notify_write (); > }or in the nbdkit case: nbdkit thread nbd_aio_pwrite (); write (selfpipe[1], &c, 1); reader thread for (;;) { dir = nbd_aio_get_direction (nbd); pollfd[0].fd = fd; pollfd[0].events = 0; pollfd[1].fd = selfpipe[0]; pollfd[1].events = POLLIN; if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN; if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT; poll (pollfd, 2, -1); if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ) nbd_aio_notify_read (); else if (pollfd[0].revents & POLLOUT && dir & LIBNBD_AIO_DIRECTION_WRITE) nbd_aio_notify_write (); if (pollfd[1].revents & POLLIN) read (selfpipe[0], &c, 1); }> > ---------------------------------------------------------------------- > > The above code is of course assuming a single thread. But to > simultaneously read and write we will need at least two threads.If we have two threads, an obvious split would be to state that requests must be done from one thread while replies are done from another, at which point the state machine would look something like: nbdkit thread id = nbd_aio_pwrite (); while (!nbd_aio_command_inflight(id)) { pollfd[0].fd = fd; pollfd[0].events = POLLOUT; poll (pollfd, 1, -1); if (pollfd[0].revents & POLLOUT) nbd_aio_notify_write (); } reader thread while (!nbd_aio_is_rstate_ready()) { pollfd[0].fd = fd; pollfd[0].events = POLLIN; poll (pollfd, 1, -1); if (pollfd[0].revents & POLLIN) nbd_aio_notify_read (); } where we could also add a new convenience API: nbd_aio_pwrite_send() that does the work mentioned for the nbdkit thread, and is documented to block the thread up until the point in time where the state machine has finished sending the request (even if it had to poll on POLLOUT in the middle), but before waiting for the server's reply; and while it can lock out other potential writers (that is, hold the h->wstate lock for the entire duration), but must not lock out readers (that is, it must drop h->lock as soon as we prove that we are okay sending the request). There probably needs to be checks for POLLHUP or POLLERR in there, to ensure the loop quits if the server hangs up or if we encounter an error that prevents further request/reply sequences.> > It's hard for me to see a nice way to evolve the API to support > multiple threads, but I guess issues include: > > - If one thread is waiting in poll(2) and another thread issues a > command, how do we get the first thread to return from poll and > reevaluate direction? (Eric suggests a pipe-to-self for this)That is, if you liked my self-pipe solution, we could somehow fold that self-pipe into libnbd internals instead of making each caller re-implement it. One idea I had for that was, at least on Linux, to use a pollfd, where the user's poll loop is as simple as: while (!nbd_aio_is_rstate_ready()) { pollfd[0].fd = fd; pollfd[0].events = POLLIN; poll (pollfd, 1, -1); if (pollfd[0].revents & POLLIN) nbd_aio_notify_read (); } because pollfd[0].fd is really a pollfd (and the libnbd, under the hood, actually calls epoll_wait() during nbd_aio_notify_read() to learn what REALLY happened, whether it was the wstate machine adding a request to the queue and tickling the self-pipe, or the rstate machine making progress). But in that model, all write()s are now done solely from the single reader loop, during nbd_aio_notify_read(); the other threads calling nbd_aio_pwrite() never call write directly, and we're back to the question of how can we read and write to the same socket simultaneously - so in addition to having two state machines, we might also have to make libnbd itself run two separate threads per handle. It's a shame that Linux does not yet have true aio for network packets: - glibc's implementation of POSIX aio_* creates a helper thread under the hood, the same as I suggested we might do above - the Linux kernel's io_submit() still performs I/O in order rather than in parallel https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/ - the Linux kernel's sendto(MSG_ZEROCOPY) can perform true async network traffic (you can regain control of your single thread prior to the full write going out the door, and thus can perform a read at the same time from a single thread), although it is only efficient once you reach the point of sending 10k or more at once> > - If two threads are waiting in poll(2) and both are notified that > the fd is ready, how do we get one of them to read and the other to > write? I think this implies that we don't have two threads doing > poll(2), but if not how do we farm out the read/write work to two > or more threads?Either we document that when using two threads, the user runs one poll loop strictly on POLLIN and the other loop strictly on POLLOUT (and doesn't have to bother with direction changes) - but only after the nbd handle has reached is_ready() for the first time (the handshake phase is very much lock-step, where a single thread checking direction all the time really is the best solution). Or we tell the user to run a single loop, manage a second thread under the hood, and the user only has to poll on POLLIN (where we do the coordination with our helper thread under the hood). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-Jun-01 06:28 UTC
[Libguestfs] [libnbd] Simultaneous read and write
I think it's possible to do this without making any large changes to the implementation of libnbd. The idea is to have a new struct socket subtype: https://github.com/libguestfs/libnbd/blob/e63a11736930c381a79a8cc2d03844cfff5db3ef/lib/internal.h#L190 which will work like the current raw struct socket for send(): https://github.com/libguestfs/libnbd/blob/e63a11736930c381a79a8cc2d03844cfff5db3ef/lib/socket.c#L32 but for recv the data will come from a buffer populated by the caller using their own thread. The caller will be required to set up a second thread which does basically: pthread_create (start_second_thread); start_second_thread () { int fd = nbd_aio_get_fd (nbd); char buf[BUFSIZ]; for (;;) { poll (fd); len = recv (fd, buf, sizeof buf, 0); nbd_aio_threaded_read (fd, buf, len); } } Note this doesn't require any changes to the state machine, and probably not any changes to the existing main loop. The state machine still runs through the same states as before, it's just that when it's reading from the socket, instead of those reads coming from calls to raw recv(), instead they will be satisfied immediately from the data supplied by the second thread and buffered up in the handle. It should also be compatible with TLS, which is currently implemented by adding a second struct socket layer over the raw struct socket, but in this case would work by adding the same layer over the threaded read struct socket. The actual mechanics of this are a bit more complex than described: - We will need the caller to tell libnbd that we want to use the handle in this way, requiring a function such as nbd_set_threaded_read() to be called before connection. The purpose of this is to ensure our connect calls set up a threaded read struct socket instead of the normal raw struct socket. - We may still need a pipe-to-self. I'm not sure if this is really necessary, but if so nbd_set_threaded_read can return a pipe fd which the caller will have to poll for POLLIN in the main loop. - I'm not sure what the end condition is for the loop in start_second_thread. Probably nbd_aio_threaded_read returns an indication to exit the thread. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Richard W.M. Jones
2019-Jun-01 13:09 UTC
Re: [Libguestfs] [libnbd] Simultaneous read and write
On Fri, May 31, 2019 at 04:47:52PM -0500, Eric Blake wrote:> On 5/31/19 11:01 AM, Richard W.M. Jones wrote: > > Problem (a) : Only one thread of control can hold the libnbd handle > > lock (h->lock), and since there can only be one thread of control > > running in each handle at any time, that thread can only be reading or > > writing. > > Indeed, when the writer thread calls nbd_aio_pread(), it is contending > on the same lock as the nbd_aio_notify_read() in the reader thread, so > we'd have to split up to several finer-grained locks (maybe keep all > locking APIs with a grab on h->lock at the beginning, but while holding > that lock we then grab the h->rstate or h->wstate lock for the rest of > the call, and drop the main h->lock at that point even though the API > doesn't return until the state machine blocks again).I disagree there's any need to split h->lock, but to try and put what you're saying more clearly (for me :-): nbd_aio_pread and nbd_aio_notify_read return quickly without blocking. [I wonder if we can instrument libnbd in some way to make sure this is always true? Perhaps we could have an instrumented libnbd which times all AIO calls and ensure they never block, or is there another way to determine if a process blocked between two places in the code? Google turns up nothing] Even if nbd_aio_pread ends up send(2)-ing to the socket or nbd_aio_notify_read recv(2)'s, neither call should block. At most there is a copy of a small amount of data between the kernel and userspace. That isn't where the contention is happening. If we go back to the diagram: write write write |===========|______________|=======|_______|===| --> rq to server _____________|============|_________|=====|_____ <-- rp from server read read During those times where I put that "write" xor "read" is happening, we're actually waiting on poll(2). The problem is the state machine is in (eg) ISSUE_COMMAND.SEND_WRITE_PAYLOAD, poll is monitoring the fd for POLLIN|POLLOUT because: https://github.com/libguestfs/libnbd/blob/e63a11736930c381a79a8cc2d03844cfff5db3ef/generator/generator#L667_L668 and if poll returns revents == POLLIN then we move to ISSUE_COMMAND.PAUSE_WRITE_PAYLOAD and then straight to REPLY.START and then we read the whole reply (even the payload) without ever checking for POLLOUT, see: https://github.com/libguestfs/libnbd/blob/e63a11736930c381a79a8cc2d03844cfff5db3ef/generator/generator#L687-L800 (This is not a criticism of your earlier patches which nicely solved what was a much worse problem with deadlocking.) My other email attempts to solve this by having a second thread which has already called recv() and buffered up the data, so we don't spend any time in the receive states, although it's possible that we only spend less time in the receive states.> With my initial use of libnbd, the division of labor for which thread > writes a packet falls into three classes: > - if the state machine is ready, the libnbd fd is polling only on > POLLIN; from there: > - if the request is small, the entire request is written by the nbdkit > thread during nbd_aio_pwrite (in this case, our write to the > pipe-to-self is actually wasted work, because the reader loop thread > will spuriously wake up, check the directions, and see that it still > only needs POLLIN for the next server response) > - if the request is large, the first half of the request is written > from the nbdkit during nbd_aio_pwrite, and the second half of the > request is written from the reader thread during nbd_aio_notify_write > (here, our write to the pipe-to-self is essential, because we HAD to > break the reader loop out of its POLLIN poll() in order to learn that it > now needs to poll for POLLIN|POLLOUT) > - if the state machine is not ready, the libnbd fd is either in the > middle of processing a reply (POLLIN) or a request (POLLIN|POLLOUT, > since we allow replies to interrupt the processing of a request). > nbd_aio_pwrite queues the new request and returns immediately, without > write()ing anything from the nbdkit thread, and the entire request gets > written by the reader thread during subsequent > nbd_aio_notify_{read,write} (in this case, our write to the pipe-to-self > is useful if libnbd was in the middle of processing a reply, but wasted > work if it was in the middle of processing a previous request)I think this reiterates what I said above, if I'm understanding it correctly. Actually have written all that, I wish there was a way we could visualize the state machine, what we're polling for, and how long we wait at each point. I might try instrumenting the code with debug statements to get a clearer picture.> > Problem (b) : There is only one state machine per handle (h->state), > > whereas to handle the write and read sides separately requires two > > state machines. In the IRC discussion we gave these the preliminary > > names h->wstate and h->rstate. > > > > ---------------------------------------------------------------------- > > > > It's worth also saying how the current API works, although we might > > want to change it. > > > > You grab the underlying file descriptor using nbd_aio_get_fd, which is > > what you poll on. You also have to call nbd_aio_get_direction which > > returns READ, WRITE or BOTH (== READ|WRITE). You then set up some > > events mechanism (eg. poll, epoll, etc.), poll until the file > > descriptor is ready, and call one of nbd_aio_notify_read or > > nbd_aio_notify_write. > > > > The direction can change any time the handle state changes, which > > includes whenever you issue a command (eg. nbd_aio_pread), or whenever > > you call nbd_aio_notify_*. You therefore have to call > > nbd_aio_get_direction frequently. > > > > A typical loop using poll might look like: > > > > fd = nbd_aio_get_fd (nbd); > > for (;;) { > > /* <-- If you need to issue more commands, do that here. */ > > dir = nbd_aio_get_direction (nbd); > > pollfd[0].fd = fd; > > pollfd[0].events = 0; > > if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN; > > if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT; > > poll (pollfd, 1, -1); > > if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_READ) > > Rather, if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ) > > > nbd_aio_notify_read (); > > else if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_WRITE) > > and again > > > nbd_aio_notify_write (); > > } > > or in the nbdkit case: > > nbdkit thread > nbd_aio_pwrite (); > write (selfpipe[1], &c, 1); > > reader thread > for (;;) { > dir = nbd_aio_get_direction (nbd); > pollfd[0].fd = fd; > pollfd[0].events = 0; > pollfd[1].fd = selfpipe[0]; > pollfd[1].events = POLLIN; > if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN; > if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT; > poll (pollfd, 2, -1); > if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ) > nbd_aio_notify_read (); > else if (pollfd[0].revents & POLLOUT && dir & > LIBNBD_AIO_DIRECTION_WRITE) > nbd_aio_notify_write (); > if (pollfd[1].revents & POLLIN) > read (selfpipe[0], &c, 1); > } > > > > > ---------------------------------------------------------------------- > > > > The above code is of course assuming a single thread. But to > > simultaneously read and write we will need at least two threads. > > If we have two threads, an obvious split would be to state that requests > must be done from one thread while replies are done from another, at > which point the state machine would look something like: > > nbdkit thread > id = nbd_aio_pwrite (); > while (!nbd_aio_command_inflight(id)) { > pollfd[0].fd = fd; > pollfd[0].events = POLLOUT; > poll (pollfd, 1, -1); > if (pollfd[0].revents & POLLOUT) > nbd_aio_notify_write (); > } > > reader thread > while (!nbd_aio_is_rstate_ready()) { > pollfd[0].fd = fd; > pollfd[0].events = POLLIN; > poll (pollfd, 1, -1); > if (pollfd[0].revents & POLLIN) > nbd_aio_notify_read (); > } > > where we could also add a new convenience API: > > nbd_aio_pwrite_send() > > that does the work mentioned for the nbdkit thread, and is documented to > block the thread up until the point in time where the state machine has > finished sending the request (even if it had to poll on POLLOUT in the > middle), but before waiting for the server's reply; and while it can > lock out other potential writers (that is, hold the h->wstate lock for > the entire duration), but must not lock out readers (that is, it must > drop h->lock as soon as we prove that we are okay sending the request). > > There probably needs to be checks for POLLHUP or POLLERR in there, to > ensure the loop quits if the server hangs up or if we encounter an error > that prevents further request/reply sequences. > > > > > It's hard for me to see a nice way to evolve the API to support > > multiple threads, but I guess issues include: > > > > - If one thread is waiting in poll(2) and another thread issues a > > command, how do we get the first thread to return from poll and > > reevaluate direction? (Eric suggests a pipe-to-self for this) > > That is, if you liked my self-pipe solution, we could somehow fold that > self-pipe into libnbd internals instead of making each caller > re-implement it. One idea I had for that was, at least on Linux, to use > a pollfd, where the user's poll loop is as simple as: > > while (!nbd_aio_is_rstate_ready()) { > pollfd[0].fd = fd; > pollfd[0].events = POLLIN; > poll (pollfd, 1, -1); > if (pollfd[0].revents & POLLIN) > nbd_aio_notify_read (); > } > > because pollfd[0].fd is really a pollfd (and the libnbd, under the hood, > actually calls epoll_wait() during nbd_aio_notify_read() to learn what > REALLY happened, whether it was the wstate machine adding a request to > the queue and tickling the self-pipe, or the rstate machine making > progress). But in that model, all write()s are now done solely from the > single reader loop, during nbd_aio_notify_read(); the other threads > calling nbd_aio_pwrite() never call write directly, and we're back to > the question of how can we read and write to the same socket > simultaneously - so in addition to having two state machines, we might > also have to make libnbd itself run two separate threads per handle. > > It's a shame that Linux does not yet have true aio for network packets: > - glibc's implementation of POSIX aio_* creates a helper thread under > the hood, the same as I suggested we might do above > > - the Linux kernel's io_submit() still performs I/O in order rather > than in parallel > https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/ > - the Linux kernel's sendto(MSG_ZEROCOPY) can perform true async > network traffic (you can regain control of your single thread prior to > the full write going out the door, and thus can perform a read at the > same time from a single thread), although it is only efficient once you > reach the point of sending 10k or more at once > > > > > - If two threads are waiting in poll(2) and both are notified that > > the fd is ready, how do we get one of them to read and the other to > > write? I think this implies that we don't have two threads doing > > poll(2), but if not how do we farm out the read/write work to two > > or more threads? > > Either we document that when using two threads, the user runs one poll > loop strictly on POLLIN and the other loop strictly on POLLOUT (and > doesn't have to bother with direction changes) - but only after the nbd > handle has reached is_ready() for the first time (the handshake phase is > very much lock-step, where a single thread checking direction all the > time really is the best solution). Or we tell the user to run a single > loop, manage a second thread under the hood, and the user only has to > poll on POLLIN (where we do the coordination with our helper thread > under the hood).I'm uneasy about creating threads behind the scenes in a library. If we need another thread, and it seems obvious we do, I think the caller should always create it. Similarly, relying on Linux-specific APIs like epoll. Let's get the libnbd API right. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Richard W.M. Jones
2019-Jun-03 08:32 UTC
Re: [Libguestfs] [libnbd] Simultaneous read and write
OK I think I have a way to solve this. However we need to post the writes to another thread, not the reads. The reason is that writes can be spooled up almost indefinitely so never block, whereas we may block on reads if the server doesn't respond fast enough. The actual change is very simple - I'll write a patch up later. The caller calls a new API such as: int nbd_set_concurrent_writer (void *data, void (*writer) (void *data, void *buf, size_t len)); ‘struct socket’ is modified here so that if a concurrent writer callback is set in the handle it calls the writer function instead of send(2): https://github.com/libguestfs/libnbd/blob/9d05f54f4daa892a1e581e1d68bc2dd4dad35e50/lib/socket.c#L48 There also has to be an error notification function that the writer can call if it encounters an error (which are essentially unrecoverable - it just moves the state to DEAD). This pushes the implementation complexity to the caller, but I think that is necessary and justified: - libraries shouldn't be creating threads - the caller has a choice of how to implement this, eg. using one writer thread over all handles, or one thread per handle - the caller can do clever stuff like pinning the writer thread to a physical core or vectoring interrupts, which would be almost impossible if the thread was hidden inside the library By including an example it should be possible to get over some of the complexity. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Possibly Parallel Threads
- [libnbd] Simultaneous read and write
- [PATCH libnbd v2 1/4] examples, tests: Remove want_to_send / ready logic, increase limit on cmds in flight.
- [nbdkit PATCH] Experiment: nbd: Use ppoll() instead of pipe-to-self
- [libnbd] tmp patch adding deadlock test
- [libnbd PATCH v3 7/7] examples: Add example to demonstrate just-fixed deadlock scenario