thr3ads.net - R devel - [Rd] download.file does not process gz files correctly (truncates them?) [May 2018]

If this information is useful, please help other people find it:
Share via:

Joris Meys

2018-May-04 08:00 UTC

[Rd] download.file does not process gz files correctly (truncates them?)

On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera <tomas.kalibera at
gmail.com>
wrote:
> The current heuristic/hack is in line with the compatibility approach: it
> detects files that are obviously binary, so it changes the default behavior
> only for cases when it would obviously cause damage.
>
> Tomas

Well, I was trying to download a .gz file and download.file() didn't detect
that. Reason for that is obviously that the link doesn't contain .gz but
%2Egz , using the ASCII code for the dot instead of the dot itself. That's
general practice in a lot of links.

Hence I propose to change the line in download.file() that does this check
to:

  if (missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
                                   URLdecode(url))))

using URLdecode() ensures that .gz, .RData etc will be detected correctly
in an encoded URL.

Cheers
Joris

-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Martin Maechler

2018-May-04 08:18 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

>>>>> Joris Meys <jorismeys at gmail.com>
>>>>>     on Fri, 4 May 2018 10:00:07 +0200 writes:
    > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera
    > <tomas.kalibera at gmail.com> wrote:

    >> The current heuristic/hack is in line with the
    >> compatibility approach: it detects files that are
    >> obviously binary, so it changes the default behavior only
    >> for cases when it would obviously cause damage.
    >> 
    >> Tomas


    > Well, I was trying to download a .gz file and
    > download.file() didn't detect that. Reason for that is
    > obviously that the link doesn't contain .gz but %2Egz ,
    > using the ASCII code for the dot instead of the dot
    > itself. That's general practice in a lot of links.

    > Hence I propose to change the line in download.file() that
    > does this check to:

    >   if (missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
    >       URLdecode(url))))

    > using URLdecode() ensures that .gz, .RData etc will be
    > detected correctly in an encoded URL.

    > Cheers Joris

Makes sense to me and I plan to add it when also adding '.rds'

{ OTOH, after reading the thread about this: Shouldn't you make
  your code more robust and use   mode = "wb" (or "ab") in
any case?
  ;-)
}
 
Martin

Henrik Bengtsson

2018-May-07 00:28 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

Thanks for the comments, feedback, and improvements.

I still argue that the current behavior cause more harm than it helps.

First of all, it increases the risk for code that does not work on all
platforms, which I'd say is one of the strengths and design goals of
R.  To write cross-platform code, a developer basically needs to
specify argument 'mode'.

A second problem is that people who work on non-Windows platforms will
not be aware of this problem.  Yes, adding this Windows-specific
behavior to the help on all platforms will help a bit (thanks for
doing that).  However, since there are so many non-Windows users out
there that write documentation, vignettes, blog posts, host classes
and workshops, it is quite likely that you'll see things like
"Download the data file using `download.file(url, file)` and then
...".  Boom, a "beginner" on Windows will have problems and even
the
non-Windows instructor may not know what's going and quickly lots of
time is wasted.

A third problem is wasted bandwidth because the same file has to be
downloaded a second time.  If the default is changed to mode="wb" and
someone truly needs mode="w", the penalty should be smaller because
such text-based files are likely to be much smaller than binary files,
which are often several GiB these days.

What could lower the risk for the above,and help the user and helpers,
is to give an informative warning whenever 'mode' is not specified,
e.g.

   The file 'NNN' is downloaded as a text file (mode = "w").
If you
meant to download it as a binary file, specify mode = "wb".

Deprecating the default mode="w" on Windows can be done in steps, e.g.
by making the argument mandatory for a while. This could be done on
all platforms because we're already all affected, i.e. we need to
specify 'mode' to avoid surprises.

Even if the default won't change, below are some more
comments/observations that is related to the current implementation of
download.file() on Windows:

ADD MORE EXTENSIONS?

What about case-insensitive matching, e.g. data.ZIP and data.Rdata?

A quick scan of the R source code suggests that R is also working with
the following filename extensions (using various case styles):

* Rbin (src/library/tools/R/install.R)
* rda, Rda (tests/reg-tests-1a.R)
* rdb (src/library/tools/R/install.R)
* rds, RDS, Rds (src/library/tools/R/install.R)
* rdx (src/library/tools/R/install.R)
* RData, Rdata, rdata (src/library/tools/R/install.R)

Should the tar extension also be added?

What about binary image formats that R produces, e.g. filename
extensions bmp, jpg, jpeg, pdf, png, tif, tiff?

What about all the other file extensions that we know for sure are binary?

VECTORIZATION:

For some value of the 'method' argument, the current implementation
will download the same file differently depending on other files
downloaded at the same time.  For example, here a PNG file is
downloaded in text mode and its content is translated:
> urls <- c("https://www.r-project.org/logo/Rlogo.png")
> download.file(urls, destfile = basename(urls), method =
"libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png'
Content length 48148 bytes (47 KB)
downloaded 47 KB> file.size(basename(urls))[1] 48281

But if we throw in a "known" binary extension, the PNG file be
downloaded as binary:
> urls <- c("https://www.r-project.org/logo/Rlogo.png",
"https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip")
> download.file(urls, destfile = basename(urls), method =
"libcurl")trying URL 'https://www.r-project.org/logo/Rlogo.png'
trying URL
'https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip'> file.size(basename(urls))[1]  48148 527069

Best,

Henrik

On Fri, May 4, 2018 at 1:18 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:>>>>>> Joris Meys <jorismeys at gmail.com>
>>>>>>     on Fri, 4 May 2018 10:00:07 +0200 writes:
>
>     > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera
>     > <tomas.kalibera at gmail.com> wrote:
>
>     >> The current heuristic/hack is in line with the
>     >> compatibility approach: it detects files that are
>     >> obviously binary, so it changes the default behavior only
>     >> for cases when it would obviously cause damage.
>     >>
>     >> Tomas
>
>
>     > Well, I was trying to download a .gz file and
>     > download.file() didn't detect that. Reason for that is
>     > obviously that the link doesn't contain .gz but %2Egz ,
>     > using the ASCII code for the dot instead of the dot
>     > itself. That's general practice in a lot of links.
>
>     > Hence I propose to change the line in download.file() that
>     > does this check to:
>
>     >   if (missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>     >       URLdecode(url))))
>
>     > using URLdecode() ensures that .gz, .RData etc will be
>     > detected correctly in an encoded URL.
>
>     > Cheers Joris
>
> Makes sense to me and I plan to add it when also adding '.rds'
>
> { OTOH, after reading the thread about this: Shouldn't you make
>   your code more robust and use   mode = "wb" (or "ab")
in any case?
>   ;-)
> }
>
> Martin
>

Reasonably Related Threads

Search for more seemingly similar threads

R devel - May 2018 - download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

Reasonably Related Threads