thr3ads.net - Libguestfs - [Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency [May 2022]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2022-May-24 14:01 UTC

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath
wrote:> Hi,
> 
> How does the blocksize filter take into account writes that end-up
overlapping due to read-modify-write cycles?
> 
> Specifically, suppose there are two non-overlapping writes handled by two
different threads, that, due to blocksize requirements, overlap when expanded. I
think there is a risk that one thread may partially undo the work of the other
here.
> 
> Looking at the code, it seems that writes of unaligned heads and tails are
protected with a global lock.,
> but writes of aligned data can occur concurrently.
> 
> However, does this not miss the case where there is one unaligned write
that overlaps with an aligned one?
> 
> For example, with blocksize 10, we could have:
The blocksize filter requires blocks to be sized as a power of 2,
which 10 is not.  I will try to restate your question using hex
boundaries; please correct me if I'm mis-stating things.
> 
> Thread 1: receives write request for offset=0, size=10
> Thread 2: receives write request for offset=4, size=16
minblock = 0x10
Thread 1: receives write request for offset 0x00, size 0x10 (aligned request)
Thread 2: receives write request for offset 0x04, size 0x16 (unaligned offset,
unaligned size)

Graphically, we are wanting to write the following, given initial disk
contents of I:

       0   0   0   0   1   1   1   1
       0...4...8...a...0...4...8...a...
start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
T1:    AAAAAAAAAAAAAAAA
T2:        BBBBBBBBBBBBBBBBBBBBBB

Because both writes are issued simultaneously, we do not know whether
bytes 0x04 through 0x0f will be written as A or B.  But our assumption
is that because blocks are written atomically, we hope to get exactly
one of the two following results, where either T1 beat T2:

       0   0   0   0   1   1   1   1
       0...4...8...a...0...4...8...a...
end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII

or where T2 beat T1:

       0   0   0   0   1   1   1   1
       0...4...8...a...0...4...8...a...
end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII

However, you are worried that a third possibility occurs:
> Thread 1: acquires lock, reads bytes 0-4
> Thread 2: does aligned write (no locking needed), writes bytes 0-10
> Thread 1: writes bytes 0-10, overwriting data from Thread 2
These do not match your initial conditions above (why is thread 1
reading; why is thread 2 doing aligned action).  Again, I'm rewriting
this into something that I think matches what you intended to ask, but
I welcome corrections:

T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
for the unaligned head (it only needs 0x00-0x03, but we have to read a
block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.

T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
blocking it.

T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:

       0   0   0   0   1   1   1   1
       0...4...8...a...0...4...8...a...
end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII

which does NOT reflect either of the possibilities where T1 and T2
write atomically.  Basically, we have become the victim of sharding.

You are correct that it is annoying that this third possibility (where
T1 appears to have never run) is possible with the blocksize filter.
And we should probably consider it as a data-corruption bug.  Your
blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
likely to hit scenarios where an older guest assumes it is writing to
512-byte aligned hardware, while using the blocksize filter to try and
guarantee RMW atomic access to 4k modern hardware.  The older client
will be unaware that it must avoid parallel writes that are
512-aligned but land in the same 4k page, so it seems like the
blocksize filter should be doing that.

You have just demonstrated that our current approach (grabbing a
single semaphore, only around the unaligned portions), does not do
what we hoped.  So what WOULD protect us, while still allowing as much
parallelism as possible?  It sounds like we want a RW-lock (note: not
in the sense of parallel pread and exclusive pwrite operations, but
rather in the sense of parallel aligned operations taking a rdlock
whether it is a pread or pwrite operation, and exclusive unaligned
operations taking a wrlock whether pread or pwrite).

pthread_rwlock_t would work, although it does not have guarantees on
whether the implementation will favor readers or writers.  It is also
possible to implement our own using 2 mutexes (favors readers), or a
condvar and mutex (easy to favor writers) [1].  Favoring readers can
starve writers indefinitely but gets the most concurrency; favoring
writers avoids starvation but has less performance.  Other policies
are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the
Linux kernel.  Maybe we make it a configurable knob which lock policy
to use, based on how likely the client is to do unaligned operations
that require a write lock?

[1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2022-May-24 14:24 UTC

head link

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

On Tue, May 24, 2022 at 09:01:28AM -0500, Eric Blake
wrote:> On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
> > Hi,
> > 
> > How does the blocksize filter take into account writes that end-up
overlapping due to read-modify-write cycles?
> > 
> > Specifically, suppose there are two non-overlapping writes handled by
two different threads, that, due to blocksize requirements, overlap when
expanded. I think there is a risk that one thread may partially undo the work of
the other here.
> > 
> > Looking at the code, it seems that writes of unaligned heads and tails
are protected with a global lock.,
> > but writes of aligned data can occur concurrently.
> > 
> > However, does this not miss the case where there is one unaligned
write that overlaps with an aligned one?
> > 
> > For example, with blocksize 10, we could have:
> 
> The blocksize filter requires blocks to be sized as a power of 2,
> which 10 is not.  I will try to restate your question using hex
> boundaries; please correct me if I'm mis-stating things.
> 
> > 
> > Thread 1: receives write request for offset=0, size=10
> > Thread 2: receives write request for offset=4, size=16
> 
> minblock = 0x10
> Thread 1: receives write request for offset 0x00, size 0x10 (aligned
request)
> Thread 2: receives write request for offset 0x04, size 0x16 (unaligned
offset, unaligned size)
> 
> Graphically, we are wanting to write the following, given initial disk
> contents of I:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
> T1:    AAAAAAAAAAAAAAAA
> T2:        BBBBBBBBBBBBBBBBBBBBBB
> 
> Because both writes are issued simultaneously, we do not know whether
> bytes 0x04 through 0x0f will be written as A or B.  But our assumption
> is that because blocks are written atomically, we hope to get exactly
> one of the two following results, where either T1 beat T2:
The NBD protocol doesn't mention "atomic" anywhere.  Are writes
actually guaranteed to be atomic like this?  I'd be a bit more
convinced if this could be shown to happen in a non-overlapping case.

(It may be that we still want to fix the blocksize filter as you've
described -- using a rwlock seems like a reasonable and non-invasive
fix -- but I'd like to understand what is guaranteed and what is a
client's problem.)
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII
> 
> or where T2 beat T1:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII
> 
> However, you are worried that a third possibility occurs:
> 
> > Thread 1: acquires lock, reads bytes 0-4
> > Thread 2: does aligned write (no locking needed), writes bytes 0-10
> > Thread 1: writes bytes 0-10, overwriting data from Thread 2
> 
> These do not match your initial conditions above (why is thread 1
> reading; why is thread 2 doing aligned action).  Again, I'm rewriting
> this into something that I think matches what you intended to ask, but
> I welcome corrections:
> 
> T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> for the unaligned head (it only needs 0x00-0x03, but we have to read a
> block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
> 
> T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> blocking it.
> 
> T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
> 
> which does NOT reflect either of the possibilities where T1 and T2
> write atomically.  Basically, we have become the victim of sharding.
> 
> You are correct that it is annoying that this third possibility (where
> T1 appears to have never run) is possible with the blocksize filter.
> And we should probably consider it as a data-corruption bug.  Your
> blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> likely to hit scenarios where an older guest assumes it is writing to
> 512-byte aligned hardware, while using the blocksize filter to try and
> guarantee RMW atomic access to 4k modern hardware.  The older client
> will be unaware that it must avoid parallel writes that are
> 512-aligned but land in the same 4k page, so it seems like the
> blocksize filter should be doing that.
> 
> You have just demonstrated that our current approach (grabbing a
> single semaphore, only around the unaligned portions), does not do
> what we hoped.  So what WOULD protect us, while still allowing as much
> parallelism as possible?  It sounds like we want a RW-lock (note: not
> in the sense of parallel pread and exclusive pwrite operations, but
> rather in the sense of parallel aligned operations taking a rdlock
> whether it is a pread or pwrite operation, and exclusive unaligned
> operations taking a wrlock whether pread or pwrite).
> 
> pthread_rwlock_t would work, although it does not have guarantees on
> whether the implementation will favor readers or writers.  It is also
> possible to implement our own using 2 mutexes (favors readers), or a
> condvar and mutex (easy to favor writers) [1].  Favoring readers can
> starve writers indefinitely but gets the most concurrency; favoring
> writers avoids starvation but has less performance.  Other policies
> are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the
> Linux kernel.  Maybe we make it a configurable knob which lock policy
> to use, based on how likely the client is to do unaligned operations
> that require a write lock?
> 
> [1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock
Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Nikolaus Rath

2022-May-24 19:45 UTC

head link

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

On May 24 2022, Eric Blake <eblake at redhat.com>
wrote:> minblock = 0x10
> Thread 1: receives write request for offset 0x00, size 0x10 (aligned
request)
> Thread 2: receives write request for offset 0x04, size 0x16 (unaligned
offset, unaligned size)
>
> Graphically, we are wanting to write the following, given initial disk
> contents of I:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
> T1:    AAAAAAAAAAAAAAAA
> T2:        BBBBBBBBBBBBBBBBBBBBBB
>
> Because both writes are issued simultaneously, we do not know whether
> bytes 0x04 through 0x0f will be written as A or B.  But our assumption
> is that because blocks are written atomically, we hope to get exactly
> one of the two following results, where either T1 beat T2:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> or where T2 beat T1:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII
>
> However, you are worried that a third possibility occurs:
>
> T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> for the unaligned head (it only needs 0x00-0x03, but we have to read a
> block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
>
> T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> blocking it.
>
> T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> which does NOT reflect either of the possibilities where T1 and T2
> write atomically.  Basically, we have become the victim of sharding.
Yes, this is the scenario that I am worried about.

I this is a data corruption problem no matter if we assume that writes
should be atomically or not.

In this scenario, the client has issued exactly one request to write
(among other things) bytes 0-4. This request was executed successfully,
so bytes 0-4 should have the new contents. There was no other write that
affected this byte range, so whether the write was done atomic or not
does not matter.
> You are correct that it is annoying that this third possibility (where
> T1 appears to have never run) is possible with the blocksize filter.
> And we should probably consider it as a data-corruption bug.  Your
> blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> likely to hit scenarios where an older guest assumes it is writing to
> 512-byte aligned hardware, while using the blocksize filter to try and
> guarantee RMW atomic access to 4k modern hardware.  The older client
> will be unaware that it must avoid parallel writes that are
> 512-aligned but land in the same 4k page, so it seems like the
> blocksize filter should be doing that.

Yes, I was just picking a very small number to illustrate the problem. I
have seen this happen in practice with much larger blocksizes (32 kB+).
> You have just demonstrated that our current approach (grabbing a
> single semaphore, only around the unaligned portions), does not do
> what we hoped.  So what WOULD protect us, while still allowing as much
> parallelism as possible?
How about a per-block lock as implemented for the S3 plugin in
https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?

It might be a bit harder to do in plain C because of the absence set
datatypes, but I think it's should work for the blocksize filter as well.


Best,
-Nikolaus
-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             ?Time flies like an arrow, fruit flies like a Banana.?

Libguestfs - May 2022 - nbdkit blocksize filter, read-modify-write, and concurrency

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency