Richard W.M. Jones
2021-Oct-13 10:06 UTC
[Libguestfs] [PATCH nbdkit] curl: Add effective-url flag
Probably best to read this first: https://bugzilla.redhat.com/show_bug.cgi?id=2013000 This adds an effective-url=true|false flag to nbdkit-curl-plugin. If true, the first time we fetch the URL, we fetch the "effective" URL (ie. the URL after all redirects were resolved) and use that for all future connections within the current nbdkit instance. The idea behind this patch is to do with the Fedora mirror system which is flakey. Some mirrors return errors or 404s. And the mirror system will give you a new redirect each time (within your geographical region). This results in transfers sometimes failing because a single read went to a bad mirror. After implementing this patch I'm not very happy with it. It is incomplete because we pass the original URL to cookie-script and header-script, so plugin/curl/scripts.c will also need to be modified. Once those modifications are done the change becomes quite invasive. Also I'm not convinced that it really solves any problem. In the manual change I wrote: Note use of this feature in long-lived nbdkit instances can cause subtle problems: ? The effective URL persists across connections for the lifetime of the nbdkit instance. If nbdkit is used for a long time then it is possible for the redirected URL to become stale. ? It will defeat some mirror load-balancing techniques. ? If the mirror service sometimes redirects to a broken URL and it happens that the URL you fetch first is broken then nbdkit will no longer recover on subsequent connections (instead you will need to restart nbdkit). I suggested another way to solve this by using curl APIs to fetch the effective URL up front and passing that URL to nbdkit (see https://bugzilla.redhat.com/show_bug.cgi?id=2013000#c1), but apparently that solution isn't acceptable for unclear reasons. I can't think of any other way to solve this in the context of nbdkit (maybe have it detect when redirection is happening and retry the redirection on error?). So here's the patch. Rich.
Richard W.M. Jones
2021-Oct-13 10:06 UTC
[Libguestfs] [PATCH nbdkit] curl: Add effective-url flag
This commit adds a new command line option (effective-url=true) which causes nbdkit-curl-plugin the first time it fetches a URL to update its internal 'url' variable with the CURLINFO_EFFECTIVE_URL. That means, the URL after all redirects have been done. Further connections will be done using this post-redirect URL, ensuring that those connections are stable. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2013000 --- plugins/curl/nbdkit-curl-plugin.pod | 35 ++++++++++++++++++++++++ plugins/curl/curl.c | 42 ++++++++++++++++++++++++++--- 2 files changed, 73 insertions(+), 4 deletions(-) diff --git a/plugins/curl/nbdkit-curl-plugin.pod b/plugins/curl/nbdkit-curl-plugin.pod index c7acf6225..673f1d327 100644 --- a/plugins/curl/nbdkit-curl-plugin.pod +++ b/plugins/curl/nbdkit-curl-plugin.pod @@ -126,6 +126,40 @@ Run C<SCRIPT> (a command or shell script fragment) to generate the HTTP/HTTPS cookies. C<cookie-script> cannot be used with C<cookie>. See L</HEADER AND COOKIE SCRIPTS> below. +=item B<effective-url=true> + +(nbdkit E<ge> 1.30) + +Replace the URL supplied on the command line with the effective URL +(from L<CURLINFO_EFFECTIVE_URL(3)>, the final URL fetched after server +redirects). This can be used with mirror services that redirect to a +geographical region ? for example file download sites ? to ensure the +URL will always be the same. + +Note use of this feature in long-lived nbdkit instances can cause +subtle problems: + +=over 4 + +=item * + +The effective URL persists across connections for the lifetime of the +nbdkit instance. If nbdkit is used for a long time then it is +possible for the redirected URL to become stale. + +=item * + +It will defeat some mirror load-balancing techniques. + +=item * + +If the mirror service sometimes redirects to a broken URL and it +happens that the URL you fetch first is broken then nbdkit will no +longer recover on subsequent connections (instead you will need to +restart nbdkit). + +=back + =item B<followlocation=false> (nbdkit E<ge> 1.26) @@ -481,6 +515,7 @@ C<nbdkit-curl-plugin> first appeared in nbdkit 1.2. L<curl(1)>, L<libcurl(3)>, +L<CURLINFO_EFFECTIVE_URL(3)>, L<CURLOPT_CAINFO(3)>, L<CURLOPT_CAPATH(3)>, L<CURLOPT_COOKIE(3)>, diff --git a/plugins/curl/curl.c b/plugins/curl/curl.c index a1b0afba7..8c016509f 100644 --- a/plugins/curl/curl.c +++ b/plugins/curl/curl.c @@ -44,6 +44,8 @@ #include <errno.h> #include <assert.h> +#include <pthread.h> + #include <curl/curl.h> #include <nbdkit-plugin.h> @@ -73,6 +75,7 @@ const char *cookiefile = NULL; const char *cookiejar = NULL; const char *cookie_script = NULL; unsigned cookie_script_renew = 0; +bool effectiveurl = false; bool followlocation = true; struct curl_slist *headers = NULL; const char *header_script = NULL; @@ -93,6 +96,12 @@ const char *unix_socket_path = NULL; const char *user = NULL; const char *user_agent = NULL; +/* For handling the effective-url flag, we save the first effective + * URL we visit in this variable. + */ +char *url_effective = NULL; +pthread_mutex_t url_effective_lock = PTHREAD_MUTEX_INITIALIZER; + /* Use '-D curl.verbose=1' to set. */ NBDKIT_DLL_PUBLIC int curl_debug_verbose = 0; @@ -118,6 +127,7 @@ curl_unload (void) free (proxy_password); scripts_unload (); curl_global_cleanup (); + free (url_effective); } /* See <curl/curl.h> */ @@ -248,6 +258,13 @@ curl_config (const char *key, const char *value) return -1; } + else if (strcmp (key, "effective-url") == 0) { + r = nbdkit_parse_bool (value); + if (r == -1) + return -1; + effectiveurl = r; + } + else if (strcmp (key, "followlocation") == 0) { r = nbdkit_parse_bool (value); if (r == -1) @@ -404,6 +421,7 @@ curl_config_complete (void) "cookiejar=<FILENAME> Read and write cookies to jar.\n" \ "cookie-script=<SCRIPT> Script to set HTTP/HTTPS cookies.\n" \ "cookie-script-renew=<SECS> Time to renew HTTP/HTTPS cookies.\n" \ + "effective-url=true Always use redirected URL.\n" \ "followlocation=false Do not follow redirects.\n" \ "header=<HEADER> Set HTTP/HTTPS header.\n" \ "header-script=<SCRIPT> Script to set HTTP/HTTPS headers.\n" \ @@ -486,10 +504,14 @@ curl_open (int readonly) } /* Set the URL. */ - r = curl_easy_setopt (h->c, CURLOPT_URL, url); - if (r != CURLE_OK) { - display_curl_error (h, r, "curl_easy_setopt: CURLOPT_URL [%s]", url); - goto err; + { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&url_effective_lock); + + r = curl_easy_setopt (h->c, CURLOPT_URL, url_effective ? : url); + if (r != CURLE_OK) { + display_curl_error (h, r, "curl_easy_setopt: CURLOPT_URL [%s]", url); + goto err; + } } /* Various options we always set. @@ -594,6 +616,18 @@ curl_open (int readonly) goto err; } + if (effectiveurl) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&url_effective_lock); + + if (url_effective == NULL) { + r = curl_easy_getinfo (h->c, CURLINFO_EFFECTIVE_URL, &url_effective); + if (r != CURLE_OK) { + display_curl_error (h, r, "could not get effective URL"); + goto err; + } + } + } + #ifdef HAVE_CURLINFO_CONTENT_LENGTH_DOWNLOAD_T r = curl_easy_getinfo (h->c, CURLINFO_CONTENT_LENGTH_DOWNLOAD_T, &o); if (r != CURLE_OK) { -- 2.32.0
Alexander Wels
2021-Oct-13 14:55 UTC
[Libguestfs] [PATCH nbdkit] curl: Add effective-url flag
On Wed, Oct 13, 2021 at 5:06 AM Richard W.M. Jones <rjones at redhat.com> wrote:> Probably best to read this first: > https://bugzilla.redhat.com/show_bug.cgi?id=2013000 > > This adds an effective-url=true|false flag to nbdkit-curl-plugin. If > true, the first time we fetch the URL, we fetch the "effective" URL > (ie. the URL after all redirects were resolved) and use that for all > future connections within the current nbdkit instance. > > The idea behind this patch is to do with the Fedora mirror system > which is flakey. Some mirrors return errors or 404s. And the mirror > system will give you a new redirect each time (within your > geographical region). This results in transfers sometimes failing > because a single read went to a bad mirror. > > After implementing this patch I'm not very happy with it. It is > incomplete because we pass the original URL to cookie-script and > header-script, so plugin/curl/scripts.c will also need to be modified. > Once those modifications are done the change becomes quite invasive. > Also I'm not convinced that it really solves any problem. In the > manual change I wrote: > > Note use of this feature in long-lived nbdkit instances can cause > subtle problems: > > ? The effective URL persists across connections for the lifetime > of the nbdkit instance. If nbdkit is used for a long time then > it is possible for the redirected URL to become stale. > > ? It will defeat some mirror load-balancing techniques. > > ? If the mirror service sometimes redirects to a broken URL and > it happens that the URL you fetch first is broken then nbdkit > will no longer recover on subsequent connections (instead you > will need to restart nbdkit). > > I suggested another way to solve this by using curl APIs to fetch the > effective URL up front and passing that URL to nbdkit (see > https://bugzilla.redhat.com/show_bug.cgi?id=2013000#c1), but > apparently that solution isn't acceptable for unclear reasons. > >The discussion is around who should be responsible for figuring out the effective URL should it be our side or should it be the nbdkit side. In an ideal world, no one would care because we never get a broken mirror, and we wouldn't be asking for any of this. However, we get bug reports about once a month about failing imports and when investigated it turns out that there is a bad mirror somewhere which causes the failed import. So one solution is to figure out the effective URL and just use that, and the question is who should figure out the effective URL. The application that starts nbdkit or nbdkit itself. IMO it boils down to who can best deal with the failures. The application that starts nbdkit knows if nbdkit is short lived or not. So the URL becoming stale is not an issue there since the application knows if it should restart nbdkit. Defeating the mirror load balancing will be true regardless of who does the pre-fetching. And if the first redirect is to the broken mirror, nbdkit will just fail and the calling application can recreate it. Personally I think the calling application knows how to handle certain scenarios well enough, that it should figure out the effective URL ahead of time. But I believe the objection from my colleagues is that it is using a hammer to solve the problem. nbdkit knows the internals of nbdkit, and it could come up with a more effective mechanism to deal with these mirror failures. Is it possible to do a retry on the byte range level? Basically if a particular range fails, retry it and possibly get a new mirror that does work, instead of failing the entire import at that point.> I can't think of any other way to solve this in the context of nbdkit > (maybe have it detect when redirection is happening and retry the > redirection on error?). So here's the patch. > > Rich. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20211013/5635c439/attachment.htm>
Laszlo Ersek
2021-Oct-14 15:18 UTC
[Libguestfs] [PATCH nbdkit] curl: Add effective-url flag
On 10/13/21 12:06, Richard W.M. Jones wrote:> Probably best to read this first: > https://bugzilla.redhat.com/show_bug.cgi?id=2013000 > > This adds an effective-url=true|false flag to nbdkit-curl-plugin. If > true, the first time we fetch the URL, we fetch the "effective" URL > (ie. the URL after all redirects were resolved) and use that for all > future connections within the current nbdkit instance. > > The idea behind this patch is to do with the Fedora mirror system > which is flakey. Some mirrors return errors or 404s. And the mirror > system will give you a new redirect each time (within your > geographical region). This results in transfers sometimes failing > because a single read went to a bad mirror. > > After implementing this patch I'm not very happy with it. It is > incomplete because we pass the original URL to cookie-script and > header-script, so plugin/curl/scripts.c will also need to be modified. > Once those modifications are done the change becomes quite invasive. > Also I'm not convinced that it really solves any problem. In the > manual change I wrote: > > Note use of this feature in long-lived nbdkit instances can cause > subtle problems: > > ? The effective URL persists across connections for the lifetime > of the nbdkit instance. If nbdkit is used for a long time then > it is possible for the redirected URL to become stale. > > ? It will defeat some mirror load-balancing techniques. > > ? If the mirror service sometimes redirects to a broken URL and > it happens that the URL you fetch first is broken then nbdkit > will no longer recover on subsequent connections (instead you > will need to restart nbdkit). > > I suggested another way to solve this by using curl APIs to fetch the > effective URL up front and passing that URL to nbdkit (see > https://bugzilla.redhat.com/show_bug.cgi?id=2013000#c1), but > apparently that solution isn't acceptable for unclear reasons. > > I can't think of any other way to solve this in the context of nbdkit > (maybe have it detect when redirection is happening and retry the > redirection on error?). So here's the patch.Given that I've been CC'd, here's my opinion: I strongly dislike transparent mirror selection (redirects) as a principle (based in experience). Precisely with the Fedora mirror system, I frequently see "dnf update" download some packages at lightning speed, then get stuck *completely* at some other package (potantially after spewing a bunch of "broken mirror" messages at me), so that I have to Ctrl-C the whole command, and re-issue it. Transparent redirects seem to want to hide the "mess" from the user, but they fail to do that quite frequently, IMO. The Cygwin experience is better, IMO. When you start the Cygwin package installer (whether you do it for initial installation or for updating packages), a fresh list of mirrors is fetched from some central location (I think?), but then you, the user, have to pick a *specific* mirror from that list. I always just go for fsn.hu, which I know to be a rock-stable mirror (for many OS distros, including Fedora) in my location. No bad surprises using that mirror, ever. With that in mind, I wouldn't complicate *any* application to deal with redirects / mirror selection transparently. Whatever the application does, the user will not be happy, and will want to tweak the logic. Just let the user pick an effective URL themselves, and stick with that forever. This may not be great for "load balancing", but AIUI the pain point here is failed *individual* imports. I think it should be OK to stick with a particular fixed URL for the duration of an import. (I should actually update my DNF repo files on Fedora to use fsn.hu as well, I just always get discouraged by the "metalink" stuff in there, and the hard-to-read variables (such as "$releasever", "$basearch", ...).)>From Alexander's description, it's clear that the reliability of themirror network is the core issue here, and we're now pushing around the unwanted job of hiding it from the user, from one application to the other. I'd say let the *user* deal with it, *once*, in their configuration, and neither application (= neither nbdkit nor the app that starts nbdkit) should struggle with redirects. In particular, tolerating (following) redirects per every single Range request looks incredibly inefficient to me (not to mention, brittle). (Sorry if my opinion is too naive.) Thanks Laszlo