thr3ads.net - Libguestfs - [Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency [May 2022]

If this information is useful, please help other people find it:
Share via:

Nikolaus Rath

2022-May-24 19:45 UTC

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

On May 24 2022, Eric Blake <eblake at redhat.com>
wrote:> minblock = 0x10
> Thread 1: receives write request for offset 0x00, size 0x10 (aligned
request)
> Thread 2: receives write request for offset 0x04, size 0x16 (unaligned
offset, unaligned size)
>
> Graphically, we are wanting to write the following, given initial disk
> contents of I:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
> T1:    AAAAAAAAAAAAAAAA
> T2:        BBBBBBBBBBBBBBBBBBBBBB
>
> Because both writes are issued simultaneously, we do not know whether
> bytes 0x04 through 0x0f will be written as A or B.  But our assumption
> is that because blocks are written atomically, we hope to get exactly
> one of the two following results, where either T1 beat T2:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> or where T2 beat T1:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII
>
> However, you are worried that a third possibility occurs:
>
> T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> for the unaligned head (it only needs 0x00-0x03, but we have to read a
> block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
>
> T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> blocking it.
>
> T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
>
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> which does NOT reflect either of the possibilities where T1 and T2
> write atomically.  Basically, we have become the victim of sharding.
Yes, this is the scenario that I am worried about.

I this is a data corruption problem no matter if we assume that writes
should be atomically or not.

In this scenario, the client has issued exactly one request to write
(among other things) bytes 0-4. This request was executed successfully,
so bytes 0-4 should have the new contents. There was no other write that
affected this byte range, so whether the write was done atomic or not
does not matter.
> You are correct that it is annoying that this third possibility (where
> T1 appears to have never run) is possible with the blocksize filter.
> And we should probably consider it as a data-corruption bug.  Your
> blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> likely to hit scenarios where an older guest assumes it is writing to
> 512-byte aligned hardware, while using the blocksize filter to try and
> guarantee RMW atomic access to 4k modern hardware.  The older client
> will be unaware that it must avoid parallel writes that are
> 512-aligned but land in the same 4k page, so it seems like the
> blocksize filter should be doing that.

Yes, I was just picking a very small number to illustrate the problem. I
have seen this happen in practice with much larger blocksizes (32 kB+).
> You have just demonstrated that our current approach (grabbing a
> single semaphore, only around the unaligned portions), does not do
> what we hoped.  So what WOULD protect us, while still allowing as much
> parallelism as possible?
How about a per-block lock as implemented for the S3 plugin in
https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?

It might be a bit harder to do in plain C because of the absence set
datatypes, but I think it's should work for the blocksize filter as well.


Best,
-Nikolaus
-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             ?Time flies like an arrow, fruit flies like a Banana.?

Eric Blake

2022-May-25 18:28 UTC

head link

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath
wrote:> > However, you are worried that a third possibility occurs:
> >
> > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> > for the unaligned head (it only needs 0x00-0x03, but we have to read a
> > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
> >
> > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> > blocking it.
> >
> > T2 now writes 0x00-0x0f using the contents of its buffer, resulting
in:
> >
> >        0   0   0   0   1   1   1   1
> >        0...4...8...a...0...4...8...a...
> > end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
> >
> > which does NOT reflect either of the possibilities where T1 and T2
> > write atomically.  Basically, we have become the victim of sharding.
> 
> Yes, this is the scenario that I am worried about.
> 
> I this is a data corruption problem no matter if we assume that writes
> should be atomically or not.
Writes to a single block should be atomic.  Writes larger than a block
need not be atomic.  But when we advertise a particular block size,
clients should be safe in assuming that anything smaller than that
block size is inadvisable (either it will fail as too small, or it
will incur a RMW penalty), but that something at the block size should
not need client-side protection when no other parallel requests are
overlapping that block.  And that is where our blocksize filter is
currently failing.
> 
> In this scenario, the client has issued exactly one request to write
> (among other things) bytes 0-4. This request was executed successfully,
> so bytes 0-4 should have the new contents. There was no other write that
> affected this byte range, so whether the write was done atomic or not
> does not matter.
Basically, because the blocksize filter advertised a block size of 1,
the client was not expecting to need to avoid non-overlapping writes
of anything larger than a 1-byte block.  So yes, the more I think
about this, the more I see it as a bug in the blocksize filter.
> 
> > You are correct that it is annoying that this third possibility (where
> > T1 appears to have never run) is possible with the blocksize filter.
> > And we should probably consider it as a data-corruption bug.  Your
> > blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> > likely to hit scenarios where an older guest assumes it is writing to
> > 512-byte aligned hardware, while using the blocksize filter to try and
> > guarantee RMW atomic access to 4k modern hardware.  The older client
> > will be unaware that it must avoid parallel writes that are
> > 512-aligned but land in the same 4k page, so it seems like the
> > blocksize filter should be doing that.
> 
> 
> Yes, I was just picking a very small number to illustrate the problem. I
> have seen this happen in practice with much larger blocksizes (32 kB+).
> 
> > You have just demonstrated that our current approach (grabbing a
> > single semaphore, only around the unaligned portions), does not do
> > what we hoped.  So what WOULD protect us, while still allowing as much
> > parallelism as possible?
> 
> How about a per-block lock as implemented for the S3 plugin in
> https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?
That feels pretty heavyweight.  It may allow more parallelism, but at
the expense of more resources.  I'm hoping that a single rwlock will
do, although it may be dominated by starvation time when toggling
between unaligned actions (serialized) vs aligned actions (parallel).
> 
> It might be a bit harder to do in plain C because of the absence set
> datatypes, but I think it's should work for the blocksize filter as
well.
Just because the language does not have a builtin set type doesn't
mean we can't code one; but you're right that it would be more code.

I'm still hoping to post a first draft of a rwlock in the blocksize
filter later today (the hardest part is trying to come up with a
testsuite that will reliably demonstrate the race without the patch).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Libguestfs - May 2022 - nbdkit blocksize filter, read-modify-write, and concurrency

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency