Laszlo Ersek
2023-May-31 11:29 UTC
[Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers
On 5/30/23 20:48, Eric Blake wrote:> On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote: >> On 5/30/23 16:56, Wouter Verhelst wrote: >>> On Tue, May 30, 2023 at 01:50:59PM +0200, Laszlo Ersek wrote: >>>> On 5/25/23 15:00, Eric Blake wrote: >> >>>>> +struct nbd_structured_reply_block_status_ext_hdr { >>>>> + uint32_t context_id; /* metadata context ID */ >>>>> + uint32_t count; /* length of following array */ >>>> >>>> (1) I think that "length of following array" is confusing here. Per >>>> spec, this is "descriptor count" -- that's clearer. "Length" could be >>>> mistaken for "number of bytes". >>>> >>>> Also, what was the justification for *not* making "count" uint64_t? >>>> (Just asking.) I do understand a server is permitted to respond with a >>>> block status reply that covers less than the requested range, so I >>>> understand a server, if it needs to, *can* truncate its response for the >>>> sake of fitting "count" into a uint32_t. >>> >>> This is, as you say, the number of block descriptor messages we are >>> going to send to the client. Each such message is at 8 bytes long. >> >> 16 bytes, isn't it? (uint64_t length, uint64_t status_flags) > > Existing block status was 8 bytes per extent (uint32_t * 2), but yes, > the extended extent info is 16 bytes (uint64_t * 2). > >> >>> That would mean that by the time you overflow a uint32_t with the number >>> of extents that are to be sent, you'll be promising to send 2^32 * 8 >>> (i.e., 2^36) bytes of data, or 32 GiB. >> >> (Or 64 GiB if we agree that sizeof(nbd_block_descriptor_ext)=16.) >> >> But... yes, I'm aware this is exorbitant, practically speaking. > > We already coded into the NBD spec (commit 926a51df) that a server > SHOULD truncate its response rather than sending more than 2M extent > entries; I picked that number because 32M (the existing payload cap in > qemu) / 16 bytes is 2M extents (for non-extended headers, it actually > caps things at 16M instead of 32M). In that commit, I pointed out > that qemu actually caps things at 128k entries (payload cap 1M without > extended headers), and nbdkit caps things at 1M entries.[*]> >> >> My concern was that "practical considerations" must have been what led >> to the original 32-bit-only: >> >> struct nbd_block_descriptor { >> uint32_t length; /* length of block */ >> uint32_t status_flags; /* block type (hole etc) */ >> } NBD_ATTRIBUTE_PACKED; >> >> which is now proving too restrictive (being the basis for this entire >> work, IIUC). >> >> A 2^(32+9) B = 2 TB image is not unthinkable. If the image used 512B >> sectors, and sectors alternated between being a hole and being >> allocated, then 2^32 extended descriptors would be justified. >> >> May not be practical, but that's "policy"; the "mechanism" could still >> exist (if it doesn't come with undesirable costs). >> >>> That would be a ridiculously amount of data, and no user is going to >>> wait for a client to finish parsing that amount of data without a hard >>> reboot of their client. >> >> Over a 10gbit/s connection, transferring 64GB should take on the order >> of a minute. >> >> ... *parsing* 4.3 billion entries is a different matter, of course ;) >> >> OK! >> >>> The only change that would be reasonable here is to reduce the size of >>> this field 16 to bits, rather than increasing it (but then we lose >>> alignment, so that's also not a good idea) >> >> Putting aside alignment even, I don't understand why reducing "count" to >> uint16_t would be reasonable. With the current 32-bit-only block >> descriptor, we already need to write loops in libnbd clients, because we >> can't cover the entire remote image in one API call [*]. If I understood >> Eric right earlier, the 64-bit extensions were supposed to remedy that >> -- but as it stands, clients will still need loops ("chunking") around >> block status fetching; is that right? > > While the larger extents reduce the need for looping, it does not > entirely eliminate it. For example, just because the server can now > tell you that an image is entirely data in just one reply does not > mean that it will actually do so - qemu in particular limits block > status of a qcow2 file to reporting just one cluster at a time for > consistency reasons, where even if you use the maximum size of 2M > clusters, you can never get more than (2M/16)*2M = 256G status > reported in a single request.I don't understand the calculation. I can imagine the following interpretation: - QEMU never sends more than 128K block descriptors, and each descriptor covers one 2MB sized cluster --> 256 GB of the disk covered in one go. But I don't understand where the (2M/16) division comes from, even though the quotient is 128K. I can connect the constant "128K", and <https://github.com/NetworkBlockDevice/nbd/commit/926a51df>, to your paragraph [*] above, but not the division.> But without 64-bit lengths, you are > guaranteed to need at least 2 (and possibly more) round trips to map > out an entire 6G image, while with 64-bit lengths, you can often get > an entire image map in one round trip. > > Reducing 'count' to uint16_t for 64-bit responses would be possible if > we wanted to hard-code a server limit of never sending more than 64k > extents in one go; but above, I argued that existing servers currently > do exceed that cap even for 32-bit responses. >OK, understood. Looping is still required, but we expect to see it less exercised in practice, with extended headers.>> >> Let me emphasize again that I'm not challenging the spec; also I don't >> secretly wish for the patches to do more than required by the spec. I >> guess I'd like to understand the intent of the spec, plus the >> consequences for client applications. >> >> [*] References (in this order!): >> >> - https://github.com/libguestfs/virt-v2v/commit/27c056cdc6aa0 >> - https://gitlab.com/nbdkit/libnbd/-/commit/0e714a6e06e6 >> - https://github.com/libguestfs/virt-v2v/commit/a2afed32d8b1f >> >> To be less cryptic, the first commit added "chunked" block status >> fetching to virt-v2v. Because our OCaml language libnbd mappings weren't >> proper at the time, that loop could move backwards and get stuck. We >> fixed that in the second commit (regarding the bindings) and then >> adapted virt-v2v's loop to the fixed bindings in the thirds commit. I've >> been hoping that such complexities could be eliminated in the future by >> virtue of not having to "chunk" the block status fetching. >> > > We'll always have to deal with servers that send shorter replies than > we asked for, and where consecutive replies may have the same status. > The best the spec was able to do was recommend that the server return > as much as it can, and without consecutive status, where feasible. > >> ( >> >> BTW I'm foreseeing a problem: if the extended block descriptor can >> provide an unsigned 64-bit length, we're going to have trouble exposing >> that in OCaml, because OCaml only has signed 64-bit integers. So that's >> going to reproduce the same issue, only for OCaml callers of the *new* API. > > At one point, I was playing with an idea to add a patch to the NBD > spec to REQUIRE that an export be capped at 2^63-1 bytes (that is, the > maximum of 'off_t'). It doesn't change any existing implementations, > and actually frees us up to use signed 64-bit numbers where negative > values are indeed error cases. I'll probably try to revisit my > thoughts on that patch front, but don't think it holds up this series. > >> >> I can see Eric's series includes patches like "ocaml: Add example for >> 64-bit extents" -- I've not looked at those yet; for now I'm just >> wondering what tricks we might need in the bindings generator. The >> method seen in the "middle patch" above won't work; we don't have a >> native OCaml "i128" type for example that we could use as an escape >> hatch, for representing C's uint64_t. > > I _did_ go with the (currently implicit) assumption that no server > will ever expose larger than off_t can represent, and therefore a > signed 64-bit size is good enough.I'll have to see that patch later, but for now I think making that assumption explicit would be nice (assert, abort, etc).> Flags has to be unsigned, but flags > is representing something different than image size.I take it you mean "nbd_block_descriptor_ext.status_flags", which is a "uint64_t". If it's only used as a bitmap, that should be safe; first because some spec (dependent on metacontext) will introduce each bit one by one, and probably not start with bit#63; second because it makes no sense to sum bitmaps. However, Wouter mentions up-thread that (dependent on metacontext) status_flags could actually stand for a length too: 'Some metadata namespaces treat the "flags" field as an unsigned integer, to specify offsets or counters. I that context, the flags field should indeed be extended to 64 bits.' and then dealing with those particular metacontexts in OCaml could be a problem. It's probably best to catch unrepresentable values when converting between OCaml's int64 and C's uint64_t (both ways) in the generator as centrally as possible. Laszlo
Eric Blake
2023-May-31 16:04 UTC
[Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers
On Wed, May 31, 2023 at 01:29:30PM +0200, Laszlo Ersek wrote:> >> Putting aside alignment even, I don't understand why reducing "count" to > >> uint16_t would be reasonable. With the current 32-bit-only block > >> descriptor, we already need to write loops in libnbd clients, because we > >> can't cover the entire remote image in one API call [*]. If I understood > >> Eric right earlier, the 64-bit extensions were supposed to remedy that > >> -- but as it stands, clients will still need loops ("chunking") around > >> block status fetching; is that right? > > > > While the larger extents reduce the need for looping, it does not > > entirely eliminate it. For example, just because the server can now > > tell you that an image is entirely data in just one reply does not > > mean that it will actually do so - qemu in particular limits block > > status of a qcow2 file to reporting just one cluster at a time for > > consistency reasons, where even if you use the maximum size of 2M > > clusters, you can never get more than (2M/16)*2M = 256G status > > reported in a single request. > > I don't understand the calculation. I can imagine the following > interpretation: > > - QEMU never sends more than 128K block descriptors, and each descriptor > covers one 2MB sized cluster --> 256 GB of the disk covered in one go. > > But I don't understand where the (2M/16) division comes from, even > though the quotient is 128K.Ah, I need to provide more backstory on the qcow2 format. A qcow2 image has a fixed cluster size, chosen between between 512 and 2M bytes. A smaller cluster size has less wasted space for small images, but uses more overhead. Each cluster has to be stored in an L1 map, where pages of the map are also a cluster in length, with 16 bytes per map entry. So if you pick a cluster size of 512, you get 512/16 or 32 entries per L1 page; if you pick a cluster size of 2M, you get 2M/16 or 128k entries per L1 page. When reporting block status, qemu reads at most one L1 page to then say how each cluster referenced from that page is mapped. https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/qcow2.txt#L491> > I can connect the constant "128K", and > <https://github.com/NetworkBlockDevice/nbd/commit/926a51df>, to your > paragraph [*] above, but not the division.In this case, the qemu limit on reporting block status of at most one L1 map page at a time happens to have no relationship to the NBD constant of limiting block status reports to no more than 1M extents (8M bytes) in a single reply, nor the fact that qemu picked a cap of 1M bytes (128k extents) on its NBD reply regardless of whether the underlying image is qcow2 or some other format.> > > But without 64-bit lengths, you are > > guaranteed to need at least 2 (and possibly more) round trips to map > > out an entire 6G image, while with 64-bit lengths, you can often get > > an entire image map in one round trip. > > > > Reducing 'count' to uint16_t for 64-bit responses would be possible if > > we wanted to hard-code a server limit of never sending more than 64k > > extents in one go; but above, I argued that existing servers currently > > do exceed that cap even for 32-bit responses. > > > > OK, understood. Looping is still required, but we expect to see it less > exercised in practice, with extended headers. > > >> > >> Let me emphasize again that I'm not challenging the spec; also I don't > >> secretly wish for the patches to do more than required by the spec. I > >> guess I'd like to understand the intent of the spec, plus the > >> consequences for client applications. > >> > >> [*] References (in this order!): > >> > >> - https://github.com/libguestfs/virt-v2v/commit/27c056cdc6aa0 > >> - https://gitlab.com/nbdkit/libnbd/-/commit/0e714a6e06e6 > >> - https://github.com/libguestfs/virt-v2v/commit/a2afed32d8b1f > >> > >> To be less cryptic, the first commit added "chunked" block status > >> fetching to virt-v2v. Because our OCaml language libnbd mappings weren't > >> proper at the time, that loop could move backwards and get stuck. We > >> fixed that in the second commit (regarding the bindings) and then > >> adapted virt-v2v's loop to the fixed bindings in the thirds commit. I've > >> been hoping that such complexities could be eliminated in the future by > >> virtue of not having to "chunk" the block status fetching. > >> > > > > We'll always have to deal with servers that send shorter replies than > > we asked for, and where consecutive replies may have the same status. > > The best the spec was able to do was recommend that the server return > > as much as it can, and without consecutive status, where feasible. > > > >> ( > >> > >> BTW I'm foreseeing a problem: if the extended block descriptor can > >> provide an unsigned 64-bit length, we're going to have trouble exposing > >> that in OCaml, because OCaml only has signed 64-bit integers. So that's > >> going to reproduce the same issue, only for OCaml callers of the *new* API. > > > > At one point, I was playing with an idea to add a patch to the NBD > > spec to REQUIRE that an export be capped at 2^63-1 bytes (that is, the > > maximum of 'off_t'). It doesn't change any existing implementations, > > and actually frees us up to use signed 64-bit numbers where negative > > values are indeed error cases. I'll probably try to revisit my > > thoughts on that patch front, but don't think it holds up this series. > > > >> > >> I can see Eric's series includes patches like "ocaml: Add example for > >> 64-bit extents" -- I've not looked at those yet; for now I'm just > >> wondering what tricks we might need in the bindings generator. The > >> method seen in the "middle patch" above won't work; we don't have a > >> native OCaml "i128" type for example that we could use as an escape > >> hatch, for representing C's uint64_t. > > > > I _did_ go with the (currently implicit) assumption that no server > > will ever expose larger than off_t can represent, and therefore a > > signed 64-bit size is good enough. > > I'll have to see that patch later, but for now I think making that > assumption explicit would be nice (assert, abort, etc). > > > Flags has to be unsigned, but flags > > is representing something different than image size. > > I take it you mean "nbd_block_descriptor_ext.status_flags", which is a > "uint64_t".Yes> > If it's only used as a bitmap, that should be safe; first because some > spec (dependent on metacontext) will introduce each bit one by one, and > probably not start with bit#63; second because it makes no sense to sum > bitmaps. > > However, Wouter mentions up-thread that (dependent on metacontext) > status_flags could actually stand for a length too: > > 'Some metadata namespaces treat the "flags" field as an unsigned > integer, to specify offsets or counters. I that context, the flags field > should indeed be extended to 64 bits.' > > and then dealing with those particular metacontexts in OCaml could be a > problem. > > It's probably best to catch unrepresentable values when converting > between OCaml's int64 and C's uint64_t (both ways) in the generator as > centrally as possible.Indeed; that's where I'm hoping Rich may chime in with ideas for what is best for the C <=> OCaml translation. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org