Henrik Bengtsson
2018-May-07 00:28 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Thanks for the comments, feedback, and improvements. I still argue that the current behavior cause more harm than it helps. First of all, it increases the risk for code that does not work on all platforms, which I'd say is one of the strengths and design goals of R. To write cross-platform code, a developer basically needs to specify argument 'mode'. A second problem is that people who work on non-Windows platforms will not be aware of this problem. Yes, adding this Windows-specific behavior to the help on all platforms will help a bit (thanks for doing that). However, since there are so many non-Windows users out there that write documentation, vignettes, blog posts, host classes and workshops, it is quite likely that you'll see things like "Download the data file using `download.file(url, file)` and then ...". Boom, a "beginner" on Windows will have problems and even the non-Windows instructor may not know what's going and quickly lots of time is wasted. A third problem is wasted bandwidth because the same file has to be downloaded a second time. If the default is changed to mode="wb" and someone truly needs mode="w", the penalty should be smaller because such text-based files are likely to be much smaller than binary files, which are often several GiB these days. What could lower the risk for the above,and help the user and helpers, is to give an informative warning whenever 'mode' is not specified, e.g. The file 'NNN' is downloaded as a text file (mode = "w"). If you meant to download it as a binary file, specify mode = "wb". Deprecating the default mode="w" on Windows can be done in steps, e.g. by making the argument mandatory for a while. This could be done on all platforms because we're already all affected, i.e. we need to specify 'mode' to avoid surprises. Even if the default won't change, below are some more comments/observations that is related to the current implementation of download.file() on Windows: ADD MORE EXTENSIONS? What about case-insensitive matching, e.g. data.ZIP and data.Rdata? A quick scan of the R source code suggests that R is also working with the following filename extensions (using various case styles): * Rbin (src/library/tools/R/install.R) * rda, Rda (tests/reg-tests-1a.R) * rdb (src/library/tools/R/install.R) * rds, RDS, Rds (src/library/tools/R/install.R) * rdx (src/library/tools/R/install.R) * RData, Rdata, rdata (src/library/tools/R/install.R) Should the tar extension also be added? What about binary image formats that R produces, e.g. filename extensions bmp, jpg, jpeg, pdf, png, tif, tiff? What about all the other file extensions that we know for sure are binary? VECTORIZATION: For some value of the 'method' argument, the current implementation will download the same file differently depending on other files downloaded at the same time. For example, here a PNG file is downloaded in text mode and its content is translated:> urls <- c("https://www.r-project.org/logo/Rlogo.png") > download.file(urls, destfile = basename(urls), method = "libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png' Content length 48148 bytes (47 KB) downloaded 47 KB> file.size(basename(urls))[1] 48281 But if we throw in a "known" binary extension, the PNG file be downloaded as binary:> urls <- c("https://www.r-project.org/logo/Rlogo.png", "https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip") > download.file(urls, destfile = basename(urls), method = "libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png' trying URL 'https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip'> file.size(basename(urls))[1] 48148 527069 Best, Henrik On Fri, May 4, 2018 at 1:18 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> Joris Meys <jorismeys at gmail.com> >>>>>> on Fri, 4 May 2018 10:00:07 +0200 writes: > > > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera > > <tomas.kalibera at gmail.com> wrote: > > >> The current heuristic/hack is in line with the > >> compatibility approach: it detects files that are > >> obviously binary, so it changes the default behavior only > >> for cases when it would obviously cause damage. > >> > >> Tomas > > > > Well, I was trying to download a .gz file and > > download.file() didn't detect that. Reason for that is > > obviously that the link doesn't contain .gz but %2Egz , > > using the ASCII code for the dot instead of the dot > > itself. That's general practice in a lot of links. > > > Hence I propose to change the line in download.file() that > > does this check to: > > > if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", > > URLdecode(url)))) > > > using URLdecode() ensures that .gz, .RData etc will be > > detected correctly in an encoded URL. > > > Cheers Joris > > Makes sense to me and I plan to add it when also adding '.rds' > > { OTOH, after reading the thread about this: Shouldn't you make > your code more robust and use mode = "wb" (or "ab") in any case? > ;-) > } > > Martin >
Joris Meys
2018-May-07 08:49 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Martin, also from me a heartfelt thank you for taking care of this. Some thoughts on Henrik's response: On Mon, May 7, 2018 at 2:28 AM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:> > I still argue that the current behavior cause more harm than it helps. >I agree with your analysis of the problems this legacy behaviour causes. Deprecating the default mode="w" on Windows can be done in steps, e.g.> by making the argument mandatory for a while. This could be done on > all platforms because we're already all affected, i.e. we need to > specify 'mode' to avoid surprises. >That sounds like a reasonable way to move away from this discrepancy between OS.> What about case-insensitive matching, e.g. data.ZIP and data.Rdata? >Totally agree, and easily solved by eg adding ignore.case = TRUE to the grep() call.> A quick scan of the R source code suggests that R is also working with > the following filename extensions (using various case styles): > > What about all the other file extensions that we know for sure are binary? >If the default isn't changed, doesn't it make more sense to actually turn the logic around? Text files that are downloaded over the internet are almost always .txt, .csv, or a few other extensions used for text data . Those are actually the only files where some people with very old Windows programs for text processing can get into trouble. So instead of adding every possible binary extension, one can put "wb" as default and change to "w" if it is a text file instead of the other way around. That would not change the concept of the behaviour, but ensures that the function doesn't fail to detect a binary file. Not detecting a text file is far less of a problem, as not converting the line endings doesn't destruct the file. Cheers Joris -- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Hugh Parsonage
2018-May-07 12:32 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
I'd add my support for mode = "wb" to (eventually) become the default, though I respect Tomas's comments about backwards-compatibility. Instead of making the argument mandatory (which would immediately break scripts -- even ones that won't be helped by changing to mode 'wb') or otherwise changing behaviour, perhaps download.file could start to emit a message (not a warning) whenever the argument is missing on Windows. The message could say something like 'Using `mode = 'w'` which will corrupt non-text files. Set `mode = 'wb'` for binary downloads or see the help page for other options.' Emitting a message has the lightest impact on existing scripts, while alerting new users to future mistakes. On 7 May 2018 at 18:49, Joris Meys <jorismeys at gmail.com> wrote:> Martin, also from me a heartfelt thank you for taking care of this. Some > thoughts on Henrik's response: > > On Mon, May 7, 2018 at 2:28 AM, Henrik Bengtsson <henrik.bengtsson at gmail.com >> wrote: > >> >> I still argue that the current behavior cause more harm than it helps. >> > > I agree with your analysis of the problems this legacy behaviour causes. > > Deprecating the default mode="w" on Windows can be done in steps, e.g. >> by making the argument mandatory for a while. This could be done on >> all platforms because we're already all affected, i.e. we need to >> specify 'mode' to avoid surprises. >> > > That sounds like a reasonable way to move away from this discrepancy > between OS. > > >> What about case-insensitive matching, e.g. data.ZIP and data.Rdata? >> > > Totally agree, and easily solved by eg adding ignore.case = TRUE to the > grep() call. > > >> A quick scan of the R source code suggests that R is also working with >> the following filename extensions (using various case styles): >> >> What about all the other file extensions that we know for sure are binary? >> > > If the default isn't changed, doesn't it make more sense to actually turn > the logic around? Text files that are downloaded over the internet are > almost always .txt, .csv, or a few other extensions used for text data . > Those are actually the only files where some people with very old Windows > programs for text processing can get into trouble. So instead of adding > every possible binary extension, one can put "wb" as default and change to > "w" if it is a text file instead of the other way around. That would not > change the concept of the behaviour, but ensures that the function doesn't > fail to detect a binary file. Not detecting a text file is far less of a > problem, as not converting the line endings doesn't destruct the file. > > Cheers > Joris > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Apparently Analagous Threads
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)