thr3ads.net - R devel - [Rd] download.file does not process gz files correctly (truncates them?) [May 2018]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2018-May-07 00:28 UTC

[Rd] download.file does not process gz files correctly (truncates them?)

Thanks for the comments, feedback, and improvements.

I still argue that the current behavior cause more harm than it helps.

First of all, it increases the risk for code that does not work on all
platforms, which I'd say is one of the strengths and design goals of
R.  To write cross-platform code, a developer basically needs to
specify argument 'mode'.

A second problem is that people who work on non-Windows platforms will
not be aware of this problem.  Yes, adding this Windows-specific
behavior to the help on all platforms will help a bit (thanks for
doing that).  However, since there are so many non-Windows users out
there that write documentation, vignettes, blog posts, host classes
and workshops, it is quite likely that you'll see things like
"Download the data file using `download.file(url, file)` and then
...".  Boom, a "beginner" on Windows will have problems and even
the
non-Windows instructor may not know what's going and quickly lots of
time is wasted.

A third problem is wasted bandwidth because the same file has to be
downloaded a second time.  If the default is changed to mode="wb" and
someone truly needs mode="w", the penalty should be smaller because
such text-based files are likely to be much smaller than binary files,
which are often several GiB these days.

What could lower the risk for the above,and help the user and helpers,
is to give an informative warning whenever 'mode' is not specified,
e.g.

   The file 'NNN' is downloaded as a text file (mode = "w").
If you
meant to download it as a binary file, specify mode = "wb".

Deprecating the default mode="w" on Windows can be done in steps, e.g.
by making the argument mandatory for a while. This could be done on
all platforms because we're already all affected, i.e. we need to
specify 'mode' to avoid surprises.

Even if the default won't change, below are some more
comments/observations that is related to the current implementation of
download.file() on Windows:

ADD MORE EXTENSIONS?

What about case-insensitive matching, e.g. data.ZIP and data.Rdata?

A quick scan of the R source code suggests that R is also working with
the following filename extensions (using various case styles):

* Rbin (src/library/tools/R/install.R)
* rda, Rda (tests/reg-tests-1a.R)
* rdb (src/library/tools/R/install.R)
* rds, RDS, Rds (src/library/tools/R/install.R)
* rdx (src/library/tools/R/install.R)
* RData, Rdata, rdata (src/library/tools/R/install.R)

Should the tar extension also be added?

What about binary image formats that R produces, e.g. filename
extensions bmp, jpg, jpeg, pdf, png, tif, tiff?

What about all the other file extensions that we know for sure are binary?

VECTORIZATION:

For some value of the 'method' argument, the current implementation
will download the same file differently depending on other files
downloaded at the same time.  For example, here a PNG file is
downloaded in text mode and its content is translated:
> urls <- c("https://www.r-project.org/logo/Rlogo.png")
> download.file(urls, destfile = basename(urls), method =
"libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png'
Content length 48148 bytes (47 KB)
downloaded 47 KB> file.size(basename(urls))[1] 48281

But if we throw in a "known" binary extension, the PNG file be
downloaded as binary:
> urls <- c("https://www.r-project.org/logo/Rlogo.png",
"https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip")
> download.file(urls, destfile = basename(urls), method =
"libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png'
trying URL
'https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip'> file.size(basename(urls))[1]  48148 527069

Best,

Henrik

On Fri, May 4, 2018 at 1:18 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:>>>>>> Joris Meys <jorismeys at gmail.com>
>>>>>>     on Fri, 4 May 2018 10:00:07 +0200 writes:
>
>     > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera
>     > <tomas.kalibera at gmail.com> wrote:
>
>     >> The current heuristic/hack is in line with the
>     >> compatibility approach: it detects files that are
>     >> obviously binary, so it changes the default behavior only
>     >> for cases when it would obviously cause damage.
>     >>
>     >> Tomas
>
>
>     > Well, I was trying to download a .gz file and
>     > download.file() didn't detect that. Reason for that is
>     > obviously that the link doesn't contain .gz but %2Egz ,
>     > using the ASCII code for the dot instead of the dot
>     > itself. That's general practice in a lot of links.
>
>     > Hence I propose to change the line in download.file() that
>     > does this check to:
>
>     >   if (missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>     >       URLdecode(url))))
>
>     > using URLdecode() ensures that .gz, .RData etc will be
>     > detected correctly in an encoded URL.
>
>     > Cheers Joris
>
> Makes sense to me and I plan to add it when also adding '.rds'
>
> { OTOH, after reading the thread about this: Shouldn't you make
>   your code more robust and use   mode = "wb" (or "ab")
in any case?
>   ;-)
> }
>
> Martin
>

Joris Meys

2018-May-07 08:49 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

Martin, also from me a heartfelt thank you for taking care of this. Some
thoughts on Henrik's response:

On Mon, May 7, 2018 at 2:28 AM, Henrik Bengtsson <henrik.bengtsson at
gmail.com> wrote:
>
> I still argue that the current behavior cause more harm than it helps.
>
I agree with your analysis of the problems this legacy behaviour causes.

Deprecating the default mode="w" on Windows can be done in steps,
e.g.> by making the argument mandatory for a while. This could be done on
> all platforms because we're already all affected, i.e. we need to
> specify 'mode' to avoid surprises.
>
That sounds like a reasonable way to move away from this discrepancy
between OS.

> What about case-insensitive matching, e.g. data.ZIP and data.Rdata?
>
Totally agree, and easily solved by eg adding ignore.case = TRUE to the
grep() call.

> A quick scan of the R source code suggests that R is also working with
> the following filename extensions (using various case styles):
>
> What about all the other file extensions that we know for sure are binary?
>
If the default isn't changed, doesn't it make more sense to actually
turn
the logic around? Text files that are downloaded over the internet are
almost always .txt, .csv, or a few other extensions used for text data .
Those are actually the only files where some people with very old Windows
programs for text processing can get into trouble. So instead of adding
every possible binary extension, one can put "wb" as default and
change to
"w" if it is a text file instead of the other way around. That would
not
change the concept of the behaviour, but ensures that the function doesn't
fail to detect a binary file. Not detecting a text file is far less of a
problem, as not converting the line endings doesn't destruct the file.

Cheers
Joris

-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Hugh Parsonage

2018-May-07 12:32 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

I'd add my support for mode = "wb" to (eventually) become the
default,
though I respect Tomas's comments about backwards-compatibility.

Instead of making the argument mandatory (which would immediately
break scripts -- even ones that won't be helped by changing to mode
'wb') or otherwise changing behaviour, perhaps download.file could
start to emit a message (not a warning) whenever the argument is
missing on Windows. The message could say something like 'Using `mode
= 'w'` which will corrupt non-text files. Set `mode = 'wb'` for
binary
downloads or see the help page for other options.' Emitting a message
has the lightest impact on existing scripts, while alerting new users
to future mistakes.

On 7 May 2018 at 18:49, Joris Meys <jorismeys at gmail.com>
wrote:> Martin, also from me a heartfelt thank you for taking care of this. Some
> thoughts on Henrik's response:
>
> On Mon, May 7, 2018 at 2:28 AM, Henrik Bengtsson <henrik.bengtsson at
gmail.com
>> wrote:
>
>>
>> I still argue that the current behavior cause more harm than it helps.
>>
>
> I agree with your analysis of the problems this legacy behaviour causes.
>
> Deprecating the default mode="w" on Windows can be done in steps,
e.g.
>> by making the argument mandatory for a while. This could be done on
>> all platforms because we're already all affected, i.e. we need to
>> specify 'mode' to avoid surprises.
>>
>
> That sounds like a reasonable way to move away from this discrepancy
> between OS.
>
>
>> What about case-insensitive matching, e.g. data.ZIP and data.Rdata?
>>
>
> Totally agree, and easily solved by eg adding ignore.case = TRUE to the
> grep() call.
>
>
>> A quick scan of the R source code suggests that R is also working with
>> the following filename extensions (using various case styles):
>>
>> What about all the other file extensions that we know for sure are
binary?
>>
>
> If the default isn't changed, doesn't it make more sense to
actually turn
> the logic around? Text files that are downloaded over the internet are
> almost always .txt, .csv, or a few other extensions used for text data .
> Those are actually the only files where some people with very old Windows
> programs for text processing can get into trouble. So instead of adding
> every possible binary extension, one can put "wb" as default and
change to
> "w" if it is a text file instead of the other way around. That
would not
> change the concept of the behaviour, but ensures that the function
doesn't
> fail to detect a binary file. Not detecting a text file is far less of a
> problem, as not converting the line endings doesn't destruct the file.
>
> Cheers
> Joris
>
> --
> Joris Meys
> Statistical consultant
>
> Department of Data Analysis and Mathematical Modelling
> Ghent University
> Coupure Links 653, B-9000 Gent (Belgium)
>
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>
> -----------
> Biowiskundedagen 2017-2018
> http://www.biowiskundedagen.ugent.be/
>
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more seemingly similar threads

R devel - May 2018 - download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

Reasonably Related Threads