Henrik Bengtsson
2018-May-03 21:14 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Also, as mentioned in my https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when not specifying the mode argument, the default on Windows is mode = "w" *except* for certain, case-sensitive, filename extensions: if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url))) mode <- "wb" Just like the need for mode = "wb" on Windows, the above special-file-extension-hack is only happening on Windows, and is only documented in ?download.file if you're on Windows; so someone who's on Linux/macOS trying to help someone on Windows may not be aware of this. This adds to even more confusions, e.g. "works for me". /Henrik On Thu, May 3, 2018 at 7:27 AM, Joris Meys <jorismeys at gmail.com> wrote:> Thank you Henrik and Martin for explaining what was going on. Very > insightful! > > On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms <jeroenooms at gmail.com> wrote: >> >> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson >> <henrik.bengtsson at gmail.com> wrote: >> > Use mode="wb" when you download the file. See >> > https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. >> > >> > R core, and others, is there a good argument for why we are not making >> > this >> > the default download mode? It seems like a such a simple fix to such a >> > common "mistake". >> >> I'd like to second this feature request. This default behaviour is >> unexpected and often leads to r scripts that were written on >> mac/linux, to produce corrupted files on windows, checksum mismatches, >> etc. >> >> Even for text files, the default should be to download the file as-is. >> Trying to "fix" line-endings should be opt-in, never the default. >> Downloading a file via a browser or ftp client on windows also doesn't >> change the file, why should R? > > > I third the feature request. > >> >> >> >> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch <murdoch.duncan at gmail.com> >> wrote: >> > Many downloads are text files (HTML, CSV, etc.), and if those are >> > downloaded >> > in binary, a Windows user might end up with a file that Notepad can't >> > handle, because it would have Unix-style line endings. >> >> True but I don't think this is relevant. The same holds e.g. for the R >> files in source packages, which also have unix line endings. Most >> Windows users will use an actual editor that understands both types of >> line endings, or can convert between the two. >> >> Downloading-file should do just that. > > > Again, I agree. In my (limited) experience the only program that fails to > properly display \n as a line ending, is Notepad. But it can still open the > file regardless. If line ending conflicts cause bugs, it's almost always a > unix-like OS struggling with Windows-style endings. I have yet to meet the > first one the other way around. > > Cheers > Joris > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
Tomas Kalibera
2018-May-04 06:34 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On 05/03/2018 11:14 PM, Henrik Bengtsson wrote:> Also, as mentioned in my > https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when > not specifying the mode argument, the default on Windows is mode = "w" > *except* for certain, case-sensitive, filename extensions: > > if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url))) > mode <- "wb" > > Just like the need for mode = "wb" on Windows, the above > special-file-extension-hack is only happening on Windows, and is only > documented in ?download.file if you're on Windows; so someone who's on > Linux/macOS trying to help someone on Windows may not be aware of > this. This adds to even more confusions, e.g. "works for me".If we were designing the API today, it would probably make more sense not to convert any line endings by default. Today's editors _usually_ can cope with different line endings and it is probably easier to detect that a text file has incorrect line endings rather than detecting that a binary file has been corrupted by an attempt to convert line endings. But whether to change existing, documented behavior is a different question. In order to help users and programmers who do not read the documentation carefully we would create problems for users and programmers who do. The current heuristic/hack is in line with the compatibility approach: it detects files that are obviously binary, so it changes the default behavior only for cases when it would obviously cause damage. Tomas> > /Henrik > > On Thu, May 3, 2018 at 7:27 AM, Joris Meys <jorismeys at gmail.com> wrote: >> Thank you Henrik and Martin for explaining what was going on. Very >> insightful! >> >> On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms <jeroenooms at gmail.com> wrote: >>> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson >>> <henrik.bengtsson at gmail.com> wrote: >>>> Use mode="wb" when you download the file. See >>>> https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. >>>> >>>> R core, and others, is there a good argument for why we are not making >>>> this >>>> the default download mode? It seems like a such a simple fix to such a >>>> common "mistake". >>> I'd like to second this feature request. This default behaviour is >>> unexpected and often leads to r scripts that were written on >>> mac/linux, to produce corrupted files on windows, checksum mismatches, >>> etc. >>> >>> Even for text files, the default should be to download the file as-is. >>> Trying to "fix" line-endings should be opt-in, never the default. >>> Downloading a file via a browser or ftp client on windows also doesn't >>> change the file, why should R? >> >> I third the feature request. >> >>> >>> >>> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch <murdoch.duncan at gmail.com> >>> wrote: >>>> Many downloads are text files (HTML, CSV, etc.), and if those are >>>> downloaded >>>> in binary, a Windows user might end up with a file that Notepad can't >>>> handle, because it would have Unix-style line endings. >>> True but I don't think this is relevant. The same holds e.g. for the R >>> files in source packages, which also have unix line endings. Most >>> Windows users will use an actual editor that understands both types of >>> line endings, or can convert between the two. >>> >>> Downloading-file should do just that. >> >> Again, I agree. In my (limited) experience the only program that fails to >> properly display \n as a line ending, is Notepad. But it can still open the >> file regardless. If line ending conflicts cause bugs, it's almost always a >> unix-like OS struggling with Windows-style endings. I have yet to meet the >> first one the other way around. >> >> Cheers >> Joris >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Department of Data Analysis and Mathematical Modelling >> Ghent University >> Coupure Links 653, B-9000 Gent (Belgium) >> >> ----------- >> Biowiskundedagen 2017-2018 >> http://www.biowiskundedagen.ugent.be/ >> >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2018-May-04 07:06 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
>>>>> Tomas Kalibera <tomas.kalibera at gmail.com> >>>>> on Fri, 4 May 2018 08:34:03 +0200 writes:> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote: >> Also, as mentioned in my >> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, >> when not specifying the mode argument, the default on >> Windows is mode = "w" *except* for certain, >> case-sensitive, filename extensions: >> >> if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", url))) >> mode <- "wb" >> >> Just like the need for mode = "wb" on Windows, the above >> special-file-extension-hack is only happening on Windows, >> and is only documented in ?download.file if you're on >> Windows; so someone who's on Linux/macOS trying to help >> someone on Windows may not be aware of this. This adds to >> even more confusions, e.g. "works for me". > If we were designing the API today, it would probably make > more sense not to convert any line endings by > default. Today's editors _usually_ can cope with different > line endings and it is probably easier to detect that a > text file has incorrect line endings rather than detecting > that a binary file has been corrupted by an attempt to > convert line endings. But whether to change existing, > documented behavior is a different question. In order to > help users and programmers who do not read the > documentation carefully we would create problems for users > and programmers who do. > The current heuristic/hack is in > line with the compatibility approach: it detects files > that are obviously binary, so it changes the default > behavior only for cases when it would obviously cause > damage. > Tomas Thank you, Tomas; I was about to say something similar but probably less convincingly. There's one thing I strongly agree with Henrik: The only-on-Windows documented Windows behavior should be documented on all platforms. I'll update the help page, and will also add the .rds extension to the above list [ --- yes, we all should use saveRDS() and readRDS() whenever sensible in favor of save() and load() ] Martin >> /Henrik >> >> On Thu, May 3, 2018 at 7:27 AM, Joris Meys >> <jorismeys at gmail.com> wrote: >>> Thank you Henrik and Martin for explaining what was >>> going on. Very insightful! >>> >>> On Thu, May 3, 2018 at 4:21 PM, Jeroen Ooms >>> <jeroenooms at gmail.com> wrote: >>>> On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson >>>> <henrik.bengtsson at gmail.com> wrote: >>>>> Use mode="wb" when you download the file. See >>>>> https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. >>>>> >>>>> R core, and others, is there a good argument for why >>>>> we are not making this the default download mode? It >>>>> seems like a such a simple fix to such a common >>>>> "mistake". >>>> I'd like to second this feature request. This default >>>> behaviour is unexpected and often leads to r scripts >>>> that were written on mac/linux, to produce corrupted >>>> files on windows, checksum mismatches, etc. >>>> >>>> Even for text files, the default should be to download >>>> the file as-is. Trying to "fix" line-endings should be >>>> opt-in, never the default. Downloading a file via a >>>> browser or ftp client on windows also doesn't change >>>> the file, why should R? >>> >>> I third the feature request. >>> >>>> >>>> >>>> On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch >>>> <murdoch.duncan at gmail.com> wrote: >>>>> Many downloads are text files (HTML, CSV, etc.), and >>>>> if those are downloaded in binary, a Windows user >>>>> might end up with a file that Notepad can't handle, >>>>> because it would have Unix-style line endings. >>>> True but I don't think this is relevant. The same holds >>>> e.g. for the R files in source packages, which also >>>> have unix line endings. Most Windows users will use an >>>> actual editor that understands both types of line >>>> endings, or can convert between the two. >>>> >>>> Downloading-file should do just that. >>> >>> Again, I agree. In my (limited) experience the only >>> program that fails to properly display \n as a line >>> ending, is Notepad. But it can still open the file >>> regardless. If line ending conflicts cause bugs, it's >>> almost always a unix-like OS struggling with >>> Windows-style endings. I have yet to meet the first one >>> the other way around. >>> >>> Cheers Joris >>> >>> >>> -- >>> Joris Meys Statistical consultant >>> >>> Department of Data Analysis and Mathematical Modelling >>> Ghent University Coupure Links 653, B-9000 Gent >>> (Belgium) >>> >>> ----------- >>> Biowiskundedagen 2017-2018 >>> http://www.biowiskundedagen.ugent.be/ >>> >>> ------------------------------- >>> Disclaimer : >>> http://helpdesk.ugent.be/e-maildisclaimer.php >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Joris Meys
2018-May-04 08:00 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> The current heuristic/hack is in line with the compatibility approach: it > detects files that are obviously binary, so it changes the default behavior > only for cases when it would obviously cause damage. > > TomasWell, I was trying to download a .gz file and download.file() didn't detect that. Reason for that is obviously that the link doesn't contain .gz but %2Egz , using the ASCII code for the dot instead of the dot itself. That's general practice in a lot of links. Hence I propose to change the line in download.file() that does this check to: if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", URLdecode(url)))) using URLdecode() ensures that .gz, .RData etc will be detected correctly in an encoded URL. Cheers Joris -- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Hadley Wickham
2018-May-08 15:15 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On Thu, May 3, 2018 at 11:34 PM, Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> On 05/03/2018 11:14 PM, Henrik Bengtsson wrote: >> >> Also, as mentioned in my >> https://stat.ethz.ch/pipermail/r-devel/2012-August/064739.html, when >> not specifying the mode argument, the default on Windows is mode = "w" >> *except* for certain, case-sensitive, filename extensions: >> >> if(missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$", >> url))) >> mode <- "wb" >> >> Just like the need for mode = "wb" on Windows, the above >> special-file-extension-hack is only happening on Windows, and is only >> documented in ?download.file if you're on Windows; so someone who's on >> Linux/macOS trying to help someone on Windows may not be aware of >> this. This adds to even more confusions, e.g. "works for me". > > If we were designing the API today, it would probably make more sense not to > convert any line endings by default. Today's editors _usually_ can cope with > different line endings and it is probably easier to detect that a text file > has incorrect line endings rather than detecting that a binary file has been > corrupted by an attempt to convert line endings. But whether to change > existing, documented behavior is a different question. In order to help > users and programmers who do not read the documentation carefully we would > create problems for users and programmers who do. The current heuristic/hack > is in line with the compatibility approach: it detects files that are > obviously binary, so it changes the default behavior only for cases when it > would obviously cause damage.>From a purely utilitarian standpoint, there are far more users who donot carefully read the documentation than users who do ;) (I'd also argue that basing the decision on the file extension is suboptimal, and it would be better to use the mime type if provided by the server) Hadley -- http://hadley.nz
Apparently Analagous Threads
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)