thr3ads.net - R devel - [Rd] download.file does not process gz files correctly (truncates them?) [May 2018]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2018-May-03 21:14 UTC

[Rd] download.file does not process gz files correctly (truncates them?)

Also, as mentioned in my
https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
not specifying the mode argument, the default on Windows is mode = "w"
*except* for certain, case-sensitive, filename extensions:

    if(missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url)))
        mode <- "wb"

Just like the need for mode = "wb" on Windows, the above
special-file-extension-hack is only happening on Windows, and is only
documented in ?download.file if you're on Windows; so someone who's on
Linux/macOS trying to help someone on Windows may not be aware of
this. This adds to even more confusions, e.g. "works for me".

/Henrik

On Thu, May 3, 2018 at 7:27 AM, Joris Meys <jorismeys at gmail.com>
wrote:> Thank you Henrik and Martin for explaining what was going on. Very
> insightful!
>
> On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms <jeroenooms at gmail.com>
wrote:
>>
>> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson
>> <henrik.bengtsson at gmail.com> wrote:
>> > Use mode="wb" when you download the file. See
>> > https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30.
>> >
>> > R core, and others, is there a good argument for why we are not
making
>> > this
>> > the default download mode? It seems like a such a simple fix to
such a
>> > common "mistake".
>>
>> I'd like to second this feature request. This default behaviour is
>> unexpected and often leads to r scripts that were written on
>> mac/linux, to produce corrupted files on windows, checksum mismatches,
>> etc.
>>
>> Even for text files, the default should be to download the file as-is.
>> Trying to "fix" line-endings should be opt-in, never the
default.
>> Downloading a file via a browser or ftp client on windows also
doesn't
>> change the file, why should R?
>
>
> I third the feature request.
>
>>
>>
>>
>> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch <murdoch.duncan at
gmail.com>
>> wrote:
>> > Many downloads are text files (HTML, CSV, etc.), and if those are
>> > downloaded
>> > in binary, a Windows user might end up with a file that Notepad
can't
>> > handle, because it would have Unix-style line endings.
>>
>> True but I don't think this is relevant. The same holds e.g. for
the R
>> files in source packages, which also have unix line endings. Most
>> Windows users will use an actual editor that understands both types of
>> line endings, or can convert between the two.
>>
>> Downloading-file should do just that.
>
>
> Again, I agree. In my (limited) experience the only program that fails to
> properly display \n as a line ending, is Notepad. But it can still open the
> file regardless. If line ending conflicts cause bugs, it's almost
always a
> unix-like OS struggling with Windows-style endings. I have yet to meet the
> first one the other way around.
>
> Cheers
> Joris
>
>
> --
> Joris Meys
> Statistical consultant
>
> Department of Data Analysis and Mathematical Modelling
> Ghent University
> Coupure Links 653, B-9000 Gent (Belgium)
>
> -----------
> Biowiskundedagen 2017-2018
> http://www.biowiskundedagen.ugent.be/
>
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

Tomas Kalibera

2018-May-04 06:34 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:> Also, as mentioned in my
> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
> not specifying the mode argument, the default on Windows is mode =
"w"
> *except* for certain, case-sensitive, filename extensions:
>
>      if(missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url)))
>          mode <- "wb"
>
> Just like the need for mode = "wb" on Windows, the above
> special-file-extension-hack is only happening on Windows, and is only
> documented in ?download.file if you're on Windows; so someone who's
on
> Linux/macOS trying to help someone on Windows may not be aware of
> this. This adds to even more confusions, e.g. "works for me".If we were designing the API today, it would probably make more sense 
not to convert any line endings by default. Today's editors _usually_ 
can cope with different line endings and it is probably easier to detect 
that a text file has incorrect line endings rather than detecting that a 
binary file has been corrupted by an attempt to convert line endings. 
But whether to change existing, documented behavior is a different 
question. In order to help users and programmers who do not read the 
documentation carefully we would create problems for users and 
programmers who do. The current heuristic/hack is in line with the 
compatibility approach: it detects files that are obviously binary, so 
it changes the default behavior only for cases when it would obviously 
cause damage.

Tomas

>
> /Henrik
>
> On Thu, May 3, 2018 at 7:27 AM, Joris Meys <jorismeys at gmail.com>
wrote:
>> Thank you Henrik and Martin for explaining what was going on. Very
>> insightful!
>>
>> On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms <jeroenooms at
gmail.com> wrote:
>>> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson
>>> <henrik.bengtsson at gmail.com> wrote:
>>>> Use mode="wb" when you download the file. See
>>>> https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30.
>>>>
>>>> R core, and others, is there a good argument for why we are not
making
>>>> this
>>>> the default download mode? It seems like a such a simple fix to
such a
>>>> common "mistake".
>>> I'd like to second this feature request. This default behaviour
is
>>> unexpected and often leads to r scripts that were written on
>>> mac/linux, to produce corrupted files on windows, checksum
mismatches,
>>> etc.
>>>
>>> Even for text files, the default should be to download the file
as-is.
>>> Trying to "fix" line-endings should be opt-in, never the
default.
>>> Downloading a file via a browser or ftp client on windows also
doesn't
>>> change the file, why should R?
>>
>> I third the feature request.
>>
>>>
>>>
>>> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch <murdoch.duncan
at gmail.com>
>>> wrote:
>>>> Many downloads are text files (HTML, CSV, etc.), and if those
are
>>>> downloaded
>>>> in binary, a Windows user might end up with a file that Notepad
can't
>>>> handle, because it would have Unix-style line endings.
>>> True but I don't think this is relevant. The same holds e.g.
for the R
>>> files in source packages, which also have unix line endings. Most
>>> Windows users will use an actual editor that understands both types
of
>>> line endings, or can convert between the two.
>>>
>>> Downloading-file should do just that.
>>
>> Again, I agree. In my (limited) experience the only program that fails
to
>> properly display \n as a line ending, is Notepad. But it can still open
the
>> file regardless. If line ending conflicts cause bugs, it's almost
always a
>> unix-like OS struggling with Windows-style endings. I have yet to meet
the
>> first one the other way around.
>>
>> Cheers
>> Joris
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Department of Data Analysis and Mathematical Modelling
>> Ghent University
>> Coupure Links 653, B-9000 Gent (Belgium)
>>
>> -----------
>> Biowiskundedagen 2017-2018
>> http://www.biowiskundedagen.ugent.be/
>>
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

2018-May-04 07:06 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

>>>>> Tomas Kalibera <tomas.kalibera at gmail.com>
>>>>>     on Fri, 4 May 2018 08:34:03 +0200 writes:
    > On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
    >> Also, as mentioned in my
    >> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html,
    >> when not specifying the mode argument, the default on
    >> Windows is mode = "w" *except* for certain,
    >> case-sensitive, filename extensions:
    >> 
    >> if(missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url)))
    >>      mode <- "wb"
    >> 
    >> Just like the need for mode = "wb" on Windows, the above
    >> special-file-extension-hack is only happening on Windows,
    >> and is only documented in ?download.file if you're on
    >> Windows; so someone who's on Linux/macOS trying to help
    >> someone on Windows may not be aware of this. This adds to
    >> even more confusions, e.g. "works for me".

    > If we were designing the API today, it would probably make
    > more sense not to convert any line endings by
    > default. Today's editors _usually_ can cope with different
    > line endings and it is probably easier to detect that a
    > text file has incorrect line endings rather than detecting
    > that a binary file has been corrupted by an attempt to
    > convert line endings.  But whether to change existing,
    > documented behavior is a different question. In order to
    > help users and programmers who do not read the
    > documentation carefully we would create problems for users
    > and programmers who do. 

    > The current heuristic/hack is in
    > line with the compatibility approach: it detects files
    > that are obviously binary, so it changes the default
    > behavior only for cases when it would obviously cause
    > damage.

    > Tomas


Thank you, Tomas;  I was about to say something similar but
probably less convincingly. 

There's one thing I strongly agree with Henrik:  The
only-on-Windows documented Windows behavior should be documented
on all platforms.

I'll update the help page,

and will also add the .rds extension to the above list
[ --- yes, we all should use saveRDS() and readRDS() whenever
      sensible in favor of save() and load() ]

Martin


    >> /Henrik
    >> 
    >> On Thu, May 3, 2018 at 7:27 AM, Joris Meys
    >> <jorismeys at gmail.com> wrote:
    >>> Thank you Henrik and Martin for explaining what was
    >>> going on. Very insightful!
    >>> 
    >>> On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms
    >>> <jeroenooms at gmail.com> wrote:
    >>>> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson
    >>>> <henrik.bengtsson at gmail.com> wrote:
    >>>>> Use mode="wb" when you download the file. See
    >>>>>
https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30.
    >>>>> 
    >>>>> R core, and others, is there a good argument for why
    >>>>> we are not making this the default download mode? It
    >>>>> seems like a such a simple fix to such a common
    >>>>> "mistake".
    >>>> I'd like to second this feature request. This default
    >>>> behaviour is unexpected and often leads to r scripts
    >>>> that were written on mac/linux, to produce corrupted
    >>>> files on windows, checksum mismatches, etc.
    >>>> 
    >>>> Even for text files, the default should be to download
    >>>> the file as-is.  Trying to "fix" line-endings
should be
    >>>> opt-in, never the default.  Downloading a file via a
    >>>> browser or ftp client on windows also doesn't change
    >>>> the file, why should R?
    >>> 
    >>> I third the feature request.
    >>> 
    >>>> 
    >>>> 
    >>>> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch
    >>>> <murdoch.duncan at gmail.com> wrote:
    >>>>> Many downloads are text files (HTML, CSV, etc.), and
    >>>>> if those are downloaded in binary, a Windows user
    >>>>> might end up with a file that Notepad can't handle,
    >>>>> because it would have Unix-style line endings.
    >>>> True but I don't think this is relevant. The same holds
    >>>> e.g. for the R files in source packages, which also
    >>>> have unix line endings. Most Windows users will use an
    >>>> actual editor that understands both types of line
    >>>> endings, or can convert between the two.
    >>>> 
    >>>> Downloading-file should do just that.
    >>> 
    >>> Again, I agree. In my (limited) experience the only
    >>> program that fails to properly display \n as a line
    >>> ending, is Notepad. But it can still open the file
    >>> regardless. If line ending conflicts cause bugs, it's
    >>> almost always a unix-like OS struggling with
    >>> Windows-style endings. I have yet to meet the first one
    >>> the other way around.
    >>> 
    >>> Cheers Joris
    >>> 
    >>> 
    >>> --
    >>> Joris Meys Statistical consultant
    >>> 
    >>> Department of Data Analysis and Mathematical Modelling
    >>> Ghent University Coupure Links 653, B-9000 Gent
    >>> (Belgium)
    >>> 
    >>> -----------
    >>> Biowiskundedagen 2017-2018
    >>> http://www.biowiskundedagen.ugent.be/
    >>> 
    >>> -------------------------------
    >>> Disclaimer :
    >>> http://helpdesk.ugent.be/e-maildisclaimer.php
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Joris Meys

2018-May-04 08:00 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera <tomas.kalibera at
gmail.com>
wrote:
> The current heuristic/hack is in line with the compatibility approach: it
> detects files that are obviously binary, so it changes the default behavior
> only for cases when it would obviously cause damage.
>
> Tomas

Well, I was trying to download a .gz file and download.file() didn't detect
that. Reason for that is obviously that the link doesn't contain .gz but
%2Egz , using the ASCII code for the dot instead of the dot itself. That's
general practice in a lot of links.

Hence I propose to change the line in download.file() that does this check
to:

  if (missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
                                   URLdecode(url))))

using URLdecode() ensures that .gz, .RData etc will be detected correctly
in an encoded URL.

Cheers
Joris

-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Hadley Wickham

2018-May-08 15:15 UTC

head link

[Rd] download.file does not process gz files correctly (truncates them?)

On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera
<tomas.kalibera at gmail.com> wrote:> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:
>>
>> Also, as mentioned in my
>> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when
>> not specifying the mode argument, the default on Windows is mode =
"w"
>> *except* for certain, case-sensitive, filename extensions:
>>
>>      if(missing(mode) &&
length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>> url)))
>>          mode <- "wb"
>>
>> Just like the need for mode = "wb" on Windows, the above
>> special-file-extension-hack is only happening on Windows, and is only
>> documented in ?download.file if you're on Windows; so someone
who's on
>> Linux/macOS trying to help someone on Windows may not be aware of
>> this. This adds to even more confusions, e.g. "works for me".
>
> If we were designing the API today, it would probably make more sense not
to
> convert any line endings by default. Today's editors _usually_ can cope
with
> different line endings and it is probably easier to detect that a text file
> has incorrect line endings rather than detecting that a binary file has
been
> corrupted by an attempt to convert line endings. But whether to change
> existing, documented behavior is a different question. In order to help
> users and programmers who do not read the documentation carefully we would
> create problems for users and programmers who do. The current
heuristic/hack
> is in line with the compatibility approach: it detects files that are
> obviously binary, so it changes the default behavior only for cases when it
> would obviously cause damage.
>From a purely utilitarian standpoint, there are far more users who donot carefully read the documentation than users who do ;)

(I'd also argue that basing the decision on the file extension is
suboptimal, and it would be better to use the mime type if provided by
the server)

Hadley

-- 
http://hadley.nz

Possibly Parallel Threads

Search for more maybe matching threads

R devel - May 2018 - download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

[Rd] download.file does not process gz files correctly (truncates them?)

Possibly Parallel Threads