Jeroen Ooms
2024-Apr-25 12:45 UTC
[Rd] Big speedup in install.packages() by re-using connections
I'd like to raise this again now that 4.4 is out. Below is a more complete patch which includes a function to properly cleanup libcurl when R quits. Implementing this is a little tricky because libcurl is a separate "module" in R, perhaps there is a better way, but this works: view: https://github.com/r-devel/r-svn/pull/166/files patch: https://github.com/r-devel/r-svn/pull/166.diff The old patch is still there as well, which is meant a minimal proof of concept to test the performance gains for reusing the connection: view: https://github.com/r-devel/r-svn/pull/155/files patch: https://github.com/r-devel/r-svn/pull/155.diff Performance gains are greatest on high-bandwidth servers when downloading many files from the same server (e.g. packages from a cran mirror). In such cases, currently over 90% of the total time is spent on initiating and tearing town a separate SSL connection for each file download. Thoughts? On Sat, Mar 2, 2024 at 3:07?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:> > Currently download.file() creates and terminates a new TLS connection > for each download. This creates a lot of overhead which is expensive > for both client and server (in particular the TLS handshake). Modern > internet clients (including browsers) re-use connections for many http > requests. > > We can do this in R by creating a persistent libcurl "multi-handle". > The R libcurl implementation already uses a multi-handle, however it > destroys it after each download, which defeats the purpose. The > purpose of the multi-handle is to keep it alive and let libcurl > maintain a persistent connection pool. This is particularly relevant > for install.packages() which needs to download many files from one and > the same server. > > Here is a bare minimal proof of concept patch that re-uses one and the > same multi-handle for all requests in R: > https://github.com/r-devel/r-svn/pull/155/files > > Some quick benchmarking shows that this can lead to big speedups for > download.packages() on high-bandwidth servers (such as CI). This quick > test to download 100 packages from CRAN showed more than 10x speedup > for me: https://github.com/r-devel/r-svn/pull/155 > > Moreover, I think this may make install.packages() more robust. In CI > build logs that download many packages, I often see one or two > downloads randomly failing with a TLS-connect error. I am hopeful this > problem will disappear when using a single connection to the CRAN > server to download all the packages.
Ivan Krylov
2024-Apr-25 15:01 UTC
[Rd] Big speedup in install.packages() by re-using connections
On Thu, 25 Apr 2024 14:45:04 +0200 Jeroen Ooms <jeroenooms at gmail.com> wrote:> Thoughts?How verboten would it be to create an empty external pointer object, add it to the preserved list, and set an on-exit finalizer to clean up the curl multi-handle? As far as I can tell, the internet module is not supposed to be unloaded, so this would not introduce an opportunity to jump to an unmapped address. This makes it possible to avoid adding a CurlCleanup() function to the internet module: Index: src/modules/internet/libcurl.c ==================================================================--- src/modules/internet/libcurl.c (revision 86484) +++ src/modules/internet/libcurl.c (working copy) @@ -55,6 +55,47 @@ static int current_timeout = 0; +// The multi-handle is shared between downloads for reusing connections +static CURLM *shared_mhnd = NULL; +static SEXP mhnd_sentinel = NULL; + +static void cleanup_mhnd(SEXP ignored) +{ + if(shared_mhnd){ + curl_multi_cleanup(shared_mhnd); + shared_mhnd = NULL; + } + curl_global_cleanup(); +} +static void rollback_mhnd_sentinel(void* sentinel) { + // Failed to allocate memory while registering a finalizer, + // therefore must release the object + R_ReleaseObject((SEXP)sentinel); +} +static CURLM *get_mhnd(void) +{ + if (!mhnd_sentinel) { + SEXP sentinel = PROTECT(R_MakeExternalPtr(NULL, R_NilValue, R_NilValue)); + R_PreserveObject(sentinel); + UNPROTECT(1); + // Avoid leaking the sentinel before setting the finalizer + RCNTXT cntxt; + begincontext(&cntxt, CTXT_CCODE, R_NilValue, R_BaseEnv, R_BaseEnv, + R_NilValue, R_NilValue); + cntxt.cend = &rollback_mhnd_sentinel; + cntxt.cenddata = sentinel; + R_RegisterCFinalizerEx(sentinel, cleanup_mhnd, TRUE); + // Succeeded, no need to clean up if endcontext() fails allocation + mhnd_sentinel = sentinel; + cntxt.cend = NULL; + endcontext(&cntxt); + } + if(!shared_mhnd) { + shared_mhnd = curl_multi_init(); + } + return shared_mhnd; +} + # if LIBCURL_VERSION_MAJOR < 7 || (LIBCURL_VERSION_MAJOR == 7 && LIBCURL_VERSION_MINOR < 28) // curl/curl.h includes <sys/select.h> and headers it requires. @@ -565,8 +606,6 @@ if (c->hnd && c->hnd[i]) curl_easy_cleanup(c->hnd[i]); } - if (c->mhnd) - curl_multi_cleanup(c->mhnd); if (c->headers) curl_slist_free_all(c->headers); @@ -668,7 +707,7 @@ c.headers = headers = tmp; } - CURLM *mhnd = curl_multi_init(); + CURLM *mhnd = get_mhnd(); if (!mhnd) error(_("could not create curl handle")); c.mhnd = mhnd; -- Best regards, Ivan
Apparently Analagous Threads
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- patch to support custom HTTP headers in download.file() and url()
- When collected warnings exceeds 50
- segmentation violation when closing the data entry window (PR#1453)