Nikolaus Rath
2022-May-21 12:21 UTC
[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency
Hi, How does the blocksize filter take into account writes that end-up overlapping due to read-modify-write cycles? Specifically, suppose there are two non-overlapping writes handled by two different threads, that, due to blocksize requirements, overlap when expanded. I think there is a risk that one thread may partially undo the work of the other here. Looking at the code, it seems that writes of unaligned heads and tails are protected with a global lock., but writes of aligned data can occur concurrently. However, does this not miss the case where there is one unaligned write that overlaps with an aligned one? For example, with blocksize 10, we could have: Thread 1: receives write request for offset=0, size=10 Thread 2: receives write request for offset=4, size=16 Thread 1: acquires lock, reads bytes 0-4 Thread 2: does aligned write (no locking needed), writes bytes 0-10 Thread 1: writes bytes 0-10, overwriting data from Thread 2 Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20220521/9dc31f20/attachment.htm>
Richard W.M. Jones
2022-May-21 14:33 UTC
[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency
On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:> Hi, > > How does the blocksize filter take into account writes that end-up > overlapping due to read-modify-write cycles? > > Specifically, suppose there are two non-overlapping writes handled > by two different threads, that, due to blocksize requirements, > overlap when expanded. I think there is a risk that one thread may > partially undo the work of the other here. > > Looking at the code, it seems that writes of unaligned heads and > tails are protected with a global lock., but writes of aligned data > can occur concurrently.I agree. Assuming the underlying plugin is NBDKIT_THREAD_MODEL_PARALLEL and no other filters impose thread model limits, the blocksize filter does not limit the thread model, so the thread model of nbdkit would also be NBDKIT_THREAD_MODEL_PARALLEL. That means that two writes either on different connections or pipelined on the same connection could happen at the same time. ?blocksize_pwrite? would be called concurrently for the two requests.> However, does this not miss the case where there is one unaligned > write that overlaps with an aligned one? > > For example, with blocksize 10, we could have: > > Thread 1: receives write request for offset=0, size=10 > Thread 2: receives write request for offset=4, size=16 > Thread 1: acquires lock, reads bytes 0-4 > Thread 2: does aligned write (no locking needed), writes bytes 0-10 > Thread 1: writes bytes 0-10, overwriting data from Thread 2I believe this analysis is correct. (CC'd to Eric who knows a lot more about this.) However I don't think it's a bug. If a client doesn't want writes to squash each other, then it shouldn't send overlapping requests. I bet the same thing happens with an SSD. NBD_CMD_FLAG_FUA is provided for clients that wish to ensure that a write has been committed before sending another request. Do you have an example of a client which sends overlapping requests and depends on particular behaviour of the server? You may be able to get it to work by using nbdkit-noparallel-filter which can be used to serialize nbdkit. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Eric Blake
2022-May-24 14:01 UTC
[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency
On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:> Hi, > > How does the blocksize filter take into account writes that end-up overlapping due to read-modify-write cycles? > > Specifically, suppose there are two non-overlapping writes handled by two different threads, that, due to blocksize requirements, overlap when expanded. I think there is a risk that one thread may partially undo the work of the other here. > > Looking at the code, it seems that writes of unaligned heads and tails are protected with a global lock., > but writes of aligned data can occur concurrently. > > However, does this not miss the case where there is one unaligned write that overlaps with an aligned one? > > For example, with blocksize 10, we could have:The blocksize filter requires blocks to be sized as a power of 2, which 10 is not. I will try to restate your question using hex boundaries; please correct me if I'm mis-stating things.> > Thread 1: receives write request for offset=0, size=10 > Thread 2: receives write request for offset=4, size=16minblock = 0x10 Thread 1: receives write request for offset 0x00, size 0x10 (aligned request) Thread 2: receives write request for offset 0x04, size 0x16 (unaligned offset, unaligned size) Graphically, we are wanting to write the following, given initial disk contents of I: 0 0 0 0 1 1 1 1 0...4...8...a...0...4...8...a... start IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII T1: AAAAAAAAAAAAAAAA T2: BBBBBBBBBBBBBBBBBBBBBB Because both writes are issued simultaneously, we do not know whether bytes 0x04 through 0x0f will be written as A or B. But our assumption is that because blocks are written atomically, we hope to get exactly one of the two following results, where either T1 beat T2: 0 0 0 0 1 1 1 1 0...4...8...a...0...4...8...a... end1: AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII or where T2 beat T1: 0 0 0 0 1 1 1 1 0...4...8...a...0...4...8...a... end2: AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII However, you are worried that a third possibility occurs:> Thread 1: acquires lock, reads bytes 0-4 > Thread 2: does aligned write (no locking needed), writes bytes 0-10 > Thread 1: writes bytes 0-10, overwriting data from Thread 2These do not match your initial conditions above (why is thread 1 reading; why is thread 2 doing aligned action). Again, I'm rewriting this into something that I think matches what you intended to ask, but I welcome corrections: T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f for the unaligned head (it only needs 0x00-0x03, but we have to read a block at a time), to populate its buffer with IIIIBBBBBBBBBBBB. T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock blocking it. T2 now writes 0x00-0x0f using the contents of its buffer, resulting in: 0 0 0 0 1 1 1 1 0...4...8...a...0...4...8...a... end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII which does NOT reflect either of the possibilities where T1 and T2 write atomically. Basically, we have become the victim of sharding. You are correct that it is annoying that this third possibility (where T1 appears to have never run) is possible with the blocksize filter. And we should probably consider it as a data-corruption bug. Your blocksize example of 10 (or 0x10) bytes is unlikely, but we are more likely to hit scenarios where an older guest assumes it is writing to 512-byte aligned hardware, while using the blocksize filter to try and guarantee RMW atomic access to 4k modern hardware. The older client will be unaware that it must avoid parallel writes that are 512-aligned but land in the same 4k page, so it seems like the blocksize filter should be doing that. You have just demonstrated that our current approach (grabbing a single semaphore, only around the unaligned portions), does not do what we hoped. So what WOULD protect us, while still allowing as much parallelism as possible? It sounds like we want a RW-lock (note: not in the sense of parallel pread and exclusive pwrite operations, but rather in the sense of parallel aligned operations taking a rdlock whether it is a pread or pwrite operation, and exclusive unaligned operations taking a wrlock whether pread or pwrite). pthread_rwlock_t would work, although it does not have guarantees on whether the implementation will favor readers or writers. It is also possible to implement our own using 2 mutexes (favors readers), or a condvar and mutex (easy to favor writers) [1]. Favoring readers can starve writers indefinitely but gets the most concurrency; favoring writers avoids starvation but has less performance. Other policies are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the Linux kernel. Maybe we make it a configurable knob which lock policy to use, based on how likely the client is to do unaligned operations that require a write lock? [1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org