Jeroen Ooms
2024-Sep-08 21:14 UTC
[Rd] Big speedup in install.packages() by re-using connections
On Mon, Sep 2, 2024 at 10:05?AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > > On 4/25/24 17:01, Ivan Krylov via R-devel wrote: > > On Thu, 25 Apr 2024 14:45:04 +0200 > > Jeroen Ooms <jeroenooms at gmail.com> wrote: > > > >> Thoughts? > > How verboten would it be to create an empty external pointer object, > > add it to the preserved list, and set an on-exit finalizer to clean up > > the curl multi-handle? As far as I can tell, the internet module is not > > supposed to be unloaded, so this would not introduce an opportunity to > > jump to an unmapped address. This makes it possible to avoid adding a > > CurlCleanup() function to the internet module: > > Cleaning up this way in principle would probably be fine, but R already > has support for re-using connections. Even more, R can download files in > parallel (in a single thread), which particularly helps with bigger > latencies (e.g. typically users connecting from home, etc). See > ?download.file(), look for "simultaneous".Thank you for looking at this. A few ideas wrt parallel downloading: Additional improvement on Windows can be achieved by enabling the nghttp2 driver in libcurl in rtools, such that it takes advantage of http2 multiplexing for parallel downloads (https://bugs.r-project.org/show_bug.cgi?id=18664). Moreover, one concern is that install.packages() may fail more frequently on low bandwidth connections due to reaching the "download timeout" when downloading files in parallel: R has an unusual definition of the http timeout, which by default aborts in-progress downloads after 60 seconds for no obvious reason. (by contrast, browsers enforce a timeout on unresponsive/stalled downloads only, which can be achieved in libcurl by setting CURLOPT_CONNECTTIMEOUT or CURLOPT_LOW_SPEED_TIME). The above is already a problem on slow networks, where large packages can fail to install with a timeout error in the download stage. Users may assume there must be a problem with the network, as it is not obvious that machines on slower internet connection need to work around R's defaults and modify options(timeout) before install.packages(). This problem could become more prevalent when using parallel downloads while still enforcing the same total timeout. For example: the MacOS binary for package "sf" is close to 90mb, hence currently, under the default R settings of options(timeout=60), install.packages will error with a download timeout on clients with less than 1.5MB/s bandwidth. But with the parallel implementation, install.packages() will share the bandwidth on 6 parallel downloads, so if "sf" is downloaded with all its dependencies, we need at least 9MB/s (i.e. a 100mbit connection) for the default settings to not cause a timeout. Hopefully this can be revised to enforce the timeout on stalled downloads only, as is common practice.
Tomas Kalibera
2024-Sep-09 09:11 UTC
[Rd] Big speedup in install.packages() by re-using connections
On 9/8/24 23:14, Jeroen Ooms wrote:> On Mon, Sep 2, 2024 at 10:05?AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote: >> >> On 4/25/24 17:01, Ivan Krylov via R-devel wrote: >>> On Thu, 25 Apr 2024 14:45:04 +0200 >>> Jeroen Ooms <jeroenooms at gmail.com> wrote: >>> >>>> Thoughts? >>> How verboten would it be to create an empty external pointer object, >>> add it to the preserved list, and set an on-exit finalizer to clean up >>> the curl multi-handle? As far as I can tell, the internet module is not >>> supposed to be unloaded, so this would not introduce an opportunity to >>> jump to an unmapped address. This makes it possible to avoid adding a >>> CurlCleanup() function to the internet module: >> Cleaning up this way in principle would probably be fine, but R already >> has support for re-using connections. Even more, R can download files in >> parallel (in a single thread), which particularly helps with bigger >> latencies (e.g. typically users connecting from home, etc). See >> ?download.file(), look for "simultaneous". > Thank you for looking at this. A few ideas wrt parallel downloading: > > Additional improvement on Windows can be achieved by enabling the > nghttp2 driver in libcurl in rtools, such that it takes advantage of > http2 multiplexing for parallel downloads > (https://bugs.r-project.org/show_bug.cgi?id=18664).Anyone who wants to cooperate and help is more than welcome to contribute patches to upstream MXE. In case of nghttp2, thanks to Andrew Johnson, who contributed nghttp2 support to upstream MXE. It will be part of the next Rtools (probably Rtools45).> Moreover, one concern is that install.packages() may fail more > frequently on low bandwidth connections due to reaching the "download > timeout" when downloading files in parallel: > > R has an unusual definition of the http timeout, which by default > aborts in-progress downloads after 60 seconds for no obvious reason. > (by contrast, browsers enforce a timeout on unresponsive/stalled > downloads only, which can be achieved in libcurl by setting > CURLOPT_CONNECTTIMEOUT or CURLOPT_LOW_SPEED_TIME). > > The above is already a problem on slow networks, where large packages > can fail to install with a timeout error in the download stage. Users > may assume there must be a problem with the network, as it is not > obvious that machines on slower internet connection need to work > around R's defaults and modify options(timeout) before > install.packages(). This problem could become more prevalent when > using parallel downloads while still enforcing the same total timeout. > > For example: the MacOS binary for package "sf" is close to 90mb, hence > currently, under the default R settings of options(timeout=60), > install.packages will error with a download timeout on clients with > less than 1.5MB/s bandwidth. But with the parallel implementation, > install.packages() will share the bandwidth on 6 parallel downloads, > so if "sf" is downloaded with all its dependencies, we need at least > 9MB/s (i.e. a 100mbit connection) for the default settings to not > cause a timeout. > > Hopefully this can be revised to enforce the timeout on stalled > downloads only, as is common practice.Yes, this is work in progress, I am aware that the timeout could use some thought re simultaneous downloads. If anyone wants to help with testing the current implementation of simultaneous download and report any bugs found, that would be nice. Best Tomas
Seemingly Similar Threads
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- Big speedup in install.packages() by re-using connections
- a quick and dirty way to compile R on win arm64 using clangarm64