Jeroen Ooms
2024-Mar-02 14:07 UTC
[Rd] Big speedup in install.packages() by re-using connections
Currently download.file() creates and terminates a new TLS connection for each download. This creates a lot of overhead which is expensive for both client and server (in particular the TLS handshake). Modern internet clients (including browsers) re-use connections for many http requests. We can do this in R by creating a persistent libcurl "multi-handle". The R libcurl implementation already uses a multi-handle, however it destroys it after each download, which defeats the purpose. The purpose of the multi-handle is to keep it alive and let libcurl maintain a persistent connection pool. This is particularly relevant for install.packages() which needs to download many files from one and the same server. Here is a bare minimal proof of concept patch that re-uses one and the same multi-handle for all requests in R: https://github.com/r-devel/r-svn/pull/155/files Some quick benchmarking shows that this can lead to big speedups for download.packages() on high-bandwidth servers (such as CI). This quick test to download 100 packages from CRAN showed more than 10x speedup for me: https://github.com/r-devel/r-svn/pull/155 Moreover, I think this may make install.packages() more robust. In CI build logs that download many packages, I often see one or two downloads randomly failing with a TLS-connect error. I am hopeful this problem will disappear when using a single connection to the CRAN server to download all the packages.
Jeroen Ooms
2024-Apr-25 12:45 UTC
[Rd] Big speedup in install.packages() by re-using connections
I'd like to raise this again now that 4.4 is out. Below is a more complete patch which includes a function to properly cleanup libcurl when R quits. Implementing this is a little tricky because libcurl is a separate "module" in R, perhaps there is a better way, but this works: view: https://github.com/r-devel/r-svn/pull/166/files patch: https://github.com/r-devel/r-svn/pull/166.diff The old patch is still there as well, which is meant a minimal proof of concept to test the performance gains for reusing the connection: view: https://github.com/r-devel/r-svn/pull/155/files patch: https://github.com/r-devel/r-svn/pull/155.diff Performance gains are greatest on high-bandwidth servers when downloading many files from the same server (e.g. packages from a cran mirror). In such cases, currently over 90% of the total time is spent on initiating and tearing town a separate SSL connection for each file download. Thoughts? On Sat, Mar 2, 2024 at 3:07?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:> > Currently download.file() creates and terminates a new TLS connection > for each download. This creates a lot of overhead which is expensive > for both client and server (in particular the TLS handshake). Modern > internet clients (including browsers) re-use connections for many http > requests. > > We can do this in R by creating a persistent libcurl "multi-handle". > The R libcurl implementation already uses a multi-handle, however it > destroys it after each download, which defeats the purpose. The > purpose of the multi-handle is to keep it alive and let libcurl > maintain a persistent connection pool. This is particularly relevant > for install.packages() which needs to download many files from one and > the same server. > > Here is a bare minimal proof of concept patch that re-uses one and the > same multi-handle for all requests in R: > https://github.com/r-devel/r-svn/pull/155/files > > Some quick benchmarking shows that this can lead to big speedups for > download.packages() on high-bandwidth servers (such as CI). This quick > test to download 100 packages from CRAN showed more than 10x speedup > for me: https://github.com/r-devel/r-svn/pull/155 > > Moreover, I think this may make install.packages() more robust. In CI > build logs that download many packages, I often see one or two > downloads randomly failing with a TLS-connect error. I am hopeful this > problem will disappear when using a single connection to the CRAN > server to download all the packages.
Apparently Analagous Threads
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections