Richard W.M. Jones
2023-Jun-06 19:06 UTC
[Libguestfs] [PATCH nbdkit 2/2] curl: Fallback to GET if HEAD not supported
Michael Henriksen pointed out an issue with this approach. If the web server is actually generating the content on the fly then it may send it as chunked encoding, and in HTTP/1.1 it's not required that the Content-Length field is present (since it may not be known when the server begins sending the content): https://www.rfc-editor.org/rfc/rfc2616#section-4.4 The only way to know the true length would be to download all the content, which we definitely don't want to do. I'm not totally clear if Content-Length, if present, must be valid. Curl source code seems to imply not: https://github.com/curl/curl/blob/6e4fedeef7ff2aefaa4841a4710d6b8a37c6ae7c/lib/http.c#L3868 but maybe they are just being over-cautious? The RFC is a bit confusing. AWS itself _does_ send Content-Length and it appears to be valid ... So one approach might be to assume it is valid, which I believe is what the current patch series does. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Richard W.M. Jones
2023-Jun-07 12:59 UTC
[Libguestfs] [PATCH nbdkit 2/2] curl: Fallback to GET if HEAD not supported
On Tue, Jun 06, 2023 at 08:06:50PM +0100, Richard W.M. Jones wrote:> > Michael Henriksen pointed out an issue with this approach. > > If the web server is actually generating the content on the fly then > it may send it as chunked encoding, and in HTTP/1.1 it's not required > that the Content-Length field is present (since it may not be known > when the server begins sending the content): > > https://www.rfc-editor.org/rfc/rfc2616#section-4.4 > > The only way to know the true length would be to download all the > content, which we definitely don't want to do. > > I'm not totally clear if Content-Length, if present, must be valid. > Curl source code seems to imply not: > > https://github.com/curl/curl/blob/6e4fedeef7ff2aefaa4841a4710d6b8a37c6ae7c/lib/http.c#L3868 > > but maybe they are just being over-cautious? The RFC is a bit > confusing. > > AWS itself _does_ send Content-Length and it appears to be valid ... > So one approach might be to assume it is valid, which I believe is > what the current patch series does.After looking at the code this morning I'm fairly sure this is a non-problem, or to be more specific it is an existing (but not serious) problem in nbdkit-curl-plugin. Curl already deals with the case where a web server sends the Transfer-encoding: chunked header with a Content-Length header (which is apparently wrong) by setting the size to -1: https://github.com/curl/curl/blob/cd18e5c4645e3a3f8ca0feecefe5b1150405c8ff/lib/http.c#L3868 which in curl-speak means the size is unknown: https://curl.se/libcurl/c/CURLINFO_CONTENT_LENGTH_DOWNLOAD_T.html and we already deal with that. Since it is already possible for a web server to set headers like this in a HEAD request, we're already dealing with this case (by rejecting the URL in the plugin). Adding this GET fallback method, which uses the common code path for fetching the size, does not make this any worse or better. I added a comment to this effect to the code, and split up the patch a lot more, and have pushed it in commits 51b86cff3..4b111535c Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Laszlo Ersek
2023-Jun-08 11:58 UTC
[Libguestfs] [PATCH nbdkit 2/2] curl: Fallback to GET if HEAD not supported
On 6/6/23 21:06, Richard W.M. Jones wrote:> > Michael Henriksen pointed out an issue with this approach. > > If the web server is actually generating the content on the fly then > it may send it as chunked encoding, and in HTTP/1.1 it's not required > that the Content-Length field is present (since it may not be known > when the server begins sending the content): > > https://www.rfc-editor.org/rfc/rfc2616#section-4.4 > > The only way to know the true length would be to download all the > content, which we definitely don't want to do. > > I'm not totally clear if Content-Length, if present, must be valid. > Curl source code seems to imply not: > > https://github.com/curl/curl/blob/6e4fedeef7ff2aefaa4841a4710d6b8a37c6ae7c/lib/http.c#L3868 > > but maybe they are just being over-cautious? The RFC is a bit > confusing. > > AWS itself _does_ send Content-Length and it appears to be valid ... > So one approach might be to assume it is valid, which I believe is > what the current patch series does.I've been aware that some servers don't send Content-Length; you can notice such servers even through simple downloads in Firefox: the download manager can't estimate for you how much time is left until the download completes. You don't get a nice progress bar. I think it's fine if nbdkit's curl plugin just doesn't support such web servers. We can simply require a (valid) Content-Length field. I think that requirement would be similar to another requirement we already place: I think our curl plugin depends on the server supporting byte ranges. Without that, we have no random access to the file on the web server, and it's not possible to expose it as a "block device". Right? Laszlo