Joris Meys
2018-May-04 08:00 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> The current heuristic/hack is in line with the compatibility approach: it > detects files that are obviously binary, so it changes the default behavior > only for cases when it would obviously cause damage. > > TomasWell, I was trying to download a .gz file and download.file() didn't detect that. Reason for that is obviously that the link doesn't contain .gz but %2Egz , using the ASCII code for the dot instead of the dot itself. That's general practice in a lot of links. Hence I propose to change the line in download.file() that does this check to: if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", URLdecode(url)))) using URLdecode() ensures that .gz, .RData etc will be detected correctly in an encoded URL. Cheers Joris -- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Martin Maechler
2018-May-04 08:18 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
>>>>> Joris Meys <jorismeys at gmail.com> >>>>> on Fri, 4 May 2018 10:00:07 +0200 writes:> On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera > <tomas.kalibera at gmail.com> wrote: >> The current heuristic/hack is in line with the >> compatibility approach: it detects files that are >> obviously binary, so it changes the default behavior only >> for cases when it would obviously cause damage. >> >> Tomas > Well, I was trying to download a .gz file and > download.file() didn't detect that. Reason for that is > obviously that the link doesn't contain .gz but %2Egz , > using the ASCII code for the dot instead of the dot > itself. That's general practice in a lot of links. > Hence I propose to change the line in download.file() that > does this check to: > if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", > URLdecode(url)))) > using URLdecode() ensures that .gz, .RData etc will be > detected correctly in an encoded URL. > Cheers Joris Makes sense to me and I plan to add it when also adding '.rds' { OTOH, after reading the thread about this: Shouldn't you make your code more robust and use mode = "wb" (or "ab") in any case? ;-) } Martin
Henrik Bengtsson
2018-May-07 00:28 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Thanks for the comments, feedback, and improvements. I still argue that the current behavior cause more harm than it helps. First of all, it increases the risk for code that does not work on all platforms, which I'd say is one of the strengths and design goals of R. To write cross-platform code, a developer basically needs to specify argument 'mode'. A second problem is that people who work on non-Windows platforms will not be aware of this problem. Yes, adding this Windows-specific behavior to the help on all platforms will help a bit (thanks for doing that). However, since there are so many non-Windows users out there that write documentation, vignettes, blog posts, host classes and workshops, it is quite likely that you'll see things like "Download the data file using `download.file(url, file)` and then ...". Boom, a "beginner" on Windows will have problems and even the non-Windows instructor may not know what's going and quickly lots of time is wasted. A third problem is wasted bandwidth because the same file has to be downloaded a second time. If the default is changed to mode="wb" and someone truly needs mode="w", the penalty should be smaller because such text-based files are likely to be much smaller than binary files, which are often several GiB these days. What could lower the risk for the above,and help the user and helpers, is to give an informative warning whenever 'mode' is not specified, e.g. The file 'NNN' is downloaded as a text file (mode = "w"). If you meant to download it as a binary file, specify mode = "wb". Deprecating the default mode="w" on Windows can be done in steps, e.g. by making the argument mandatory for a while. This could be done on all platforms because we're already all affected, i.e. we need to specify 'mode' to avoid surprises. Even if the default won't change, below are some more comments/observations that is related to the current implementation of download.file() on Windows: ADD MORE EXTENSIONS? What about case-insensitive matching, e.g. data.ZIP and data.Rdata? A quick scan of the R source code suggests that R is also working with the following filename extensions (using various case styles): * Rbin (src/library/tools/R/install.R) * rda, Rda (tests/reg-tests-1a.R) * rdb (src/library/tools/R/install.R) * rds, RDS, Rds (src/library/tools/R/install.R) * rdx (src/library/tools/R/install.R) * RData, Rdata, rdata (src/library/tools/R/install.R) Should the tar extension also be added? What about binary image formats that R produces, e.g. filename extensions bmp, jpg, jpeg, pdf, png, tif, tiff? What about all the other file extensions that we know for sure are binary? VECTORIZATION: For some value of the 'method' argument, the current implementation will download the same file differently depending on other files downloaded at the same time. For example, here a PNG file is downloaded in text mode and its content is translated:> urls <- c("https://www.r-project.org/logo/Rlogo.png") > download.file(urls, destfile = basename(urls), method = "libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png' Content length 48148 bytes (47 KB) downloaded 47 KB> file.size(basename(urls))[1] 48281 But if we throw in a "known" binary extension, the PNG file be downloaded as binary:> urls <- c("https://www.r-project.org/logo/Rlogo.png", "https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip") > download.file(urls, destfile = basename(urls), method = "libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png' trying URL 'https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip'> file.size(basename(urls))[1] 48148 527069 Best, Henrik On Fri, May 4, 2018 at 1:18 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> Joris Meys <jorismeys at gmail.com> >>>>>> on Fri, 4 May 2018 10:00:07 +0200 writes: > > > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera > > <tomas.kalibera at gmail.com> wrote: > > >> The current heuristic/hack is in line with the > >> compatibility approach: it detects files that are > >> obviously binary, so it changes the default behavior only > >> for cases when it would obviously cause damage. > >> > >> Tomas > > > > Well, I was trying to download a .gz file and > > download.file() didn't detect that. Reason for that is > > obviously that the link doesn't contain .gz but %2Egz , > > using the ASCII code for the dot instead of the dot > > itself. That's general practice in a lot of links. > > > Hence I propose to change the line in download.file() that > > does this check to: > > > if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", > > URLdecode(url)))) > > > using URLdecode() ensures that .gz, .RData etc will be > > detected correctly in an encoded URL. > > > Cheers Joris > > Makes sense to me and I plan to add it when also adding '.rds' > > { OTOH, after reading the thread about this: Shouldn't you make > your code more robust and use mode = "wb" (or "ab") in any case? > ;-) > } > > Martin >
Seemingly Similar Threads
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)