Kevin Ushey
2015-Aug-25 20:30 UTC
[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)
Hi Martin, Indeed it does (and I should have confirmed myself with R-patched and R-devel before posting...) Thanks, and sorry for the noise. Kevin On Tue, Aug 25, 2015, 13:11 Martin Morgan <mtmorgan at fredhutch.org> wrote:> On 08/25/2015 12:54 PM, Kevin Ushey wrote: > > Hi all, > > > > The following fails for me (on OS X, although I imagine it's the same > > on other platforms using libcurl): > > > > options(download.file.method = "libcurl") > > options(repos = c(CRAN = "https://cran.rstudio.com/", CRANextra > > "http://www.stats.ox.ac.uk/pub/RWin")) > > install.packages("lattice") ## could be any package > > > > gives me: > > > > > options(download.file.method = "libcurl") > > > options(repos = c(CRAN = "https://cran.rstudio.com/", CRANextra > > = "http://www.stats.ox.ac.uk/pub/RWin")) > > > install.packages("lattice") ## coudl be any package > > Installing package into ?/Users/kevinushey/Library/R/3.2/library? > > (as ?lib? is unspecified) > > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > > > > This seems to come from a call to `available.packages()` to a URL that > > doesn't exist on the server (likely when querying PACKAGES on the > > CRANextra repo) > > > > Eg. > > > > > URL <- "http://www.stats.ox.ac.uk/pub/RWin" > > > available.packages(URL, method = "internal") > > Warning: unable to access index for repository > > http://www.stats.ox.ac.uk/pub/RWin > > Package Version Priority Depends Imports LinkingTo Suggests > > Enhances License License_is_FOSS > > License_restricts_use OS_type Archs MD5sum NeedsCompilation > > File Repository > > > available.packages(URL, method = "libcurl") > > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > > > > It looks like libcurl downloads and retrieves the 403 page itself, > > rather than reporting that it was actually forbidden, e.g.: > > > > > download.file(" > http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz > ", > > tempfile(), method = "libcurl") > > trying URL ' > http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz > ' > > Content type 'text/html; charset=iso-8859-1' length 339 bytes > > =================================================> > downloaded 339 bytes > > > > Using `method = "internal"` gives an error related to the inability to > > access that URL due to the HTTP status 403. > > > > The overarching issue here is that package installation shouldn't fail > > even if libcurl fails to access one of the repositories set. > > > > With > > > R.version.string > [1] "R version 3.2.2 Patched (2015-08-25 r69179)" > > the behavior is to warn with an indication of the repository for which the > problem occurs > > > URL <- "http://www.stats.ox.ac.uk/pub/RWin" > > available.packages(URL, method="libcurl") > Warning: unable to access index for repository > http://www.stats.ox.ac.uk/pub/RWin: > Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > Package Version Priority Depends Imports LinkingTo Suggests Enhances > License License_is_FOSS License_restricts_use OS_type Archs MD5sum > NeedsCompilation File Repository > > available.packages(URL, method="internal") > Warning: unable to access index for repository > http://www.stats.ox.ac.uk/pub/RWin: > cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/PACKAGES' > Package Version Priority Depends Imports LinkingTo Suggests Enhances > License License_is_FOSS License_restricts_use OS_type Archs MD5sum > NeedsCompilation File Repository > > Does that work for you / address the problem? > > Martin > > >> sessionInfo() > > R version 3.2.2 (2015-08-14) > > Platform: x86_64-apple-darwin13.4.0 (64-bit) > > Running under: OS X 10.10.4 (Yosemite) > > > > locale: > > [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] testthat_0.8.1.0.99 knitr_1.11 devtools_1.5.0.9001 > > [4] BiocInstaller_1.15.5 > > > > loaded via a namespace (and not attached): > > [1] httr_1.0.0 R6_2.0.0.9000 tools_3.2.2 parallel_3.2.2 > whisker_0.3-2 > > [6] RCurl_1.95-4.1 memoise_0.2.1 stringr_0.6.2 digest_0.6.4 > evaluate_0.7.2 > > > > Thanks, > > Kevin > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 >[[alternative HTML version deleted]]
Martin Morgan
2015-Aug-25 20:33 UTC
[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)
On 08/25/2015 01:30 PM, Kevin Ushey wrote:> Hi Martin, > > Indeed it does (and I should have confirmed myself with R-patched and R-devel > before posting...)actually I don't know that it does -- it addresses the symptom but I think there should be an error from libcurl on the 403 / 404 rather than from read.dcf on error page... Martin> > Thanks, and sorry for the noise. > Kevin > > > On Tue, Aug 25, 2015, 13:11 Martin Morgan <mtmorgan at fredhutch.org > <mailto:mtmorgan at fredhutch.org>> wrote: > > On 08/25/2015 12:54 PM, Kevin Ushey wrote: > > Hi all, > > > > The following fails for me (on OS X, although I imagine it's the same > > on other platforms using libcurl): > > > > options(download.file.method = "libcurl") > > options(repos = c(CRAN = "https://cran.rstudio.com/", CRANextra > > "http://www.stats.ox.ac.uk/pub/RWin")) > > install.packages("lattice") ## could be any package > > > > gives me: > > > > > options(download.file.method = "libcurl") > > > options(repos = c(CRAN = "https://cran.rstudio.com/", CRANextra > > = "http://www.stats.ox.ac.uk/pub/RWin")) > > > install.packages("lattice") ## coudl be any package > > Installing package into ?/Users/kevinushey/Library/R/3.2/library? > > (as ?lib? is unspecified) > > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > > > > This seems to come from a call to `available.packages()` to a URL that > > doesn't exist on the server (likely when querying PACKAGES on the > > CRANextra repo) > > > > Eg. > > > > > URL <- "http://www.stats.ox.ac.uk/pub/RWin" > > > available.packages(URL, method = "internal") > > Warning: unable to access index for repository > > http://www.stats.ox.ac.uk/pub/RWin > > Package Version Priority Depends Imports LinkingTo Suggests > > Enhances License License_is_FOSS > > License_restricts_use OS_type Archs MD5sum NeedsCompilation > > File Repository > > > available.packages(URL, method = "libcurl") > > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > > > > It looks like libcurl downloads and retrieves the 403 page itself, > > rather than reporting that it was actually forbidden, e.g.: > > > > > > download.file("http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz", > > tempfile(), method = "libcurl") > > trying URL > 'http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz' > > Content type 'text/html; charset=iso-8859-1' length 339 bytes > > =================================================> > downloaded 339 bytes > > > > Using `method = "internal"` gives an error related to the inability to > > access that URL due to the HTTP status 403. > > > > The overarching issue here is that package installation shouldn't fail > > even if libcurl fails to access one of the repositories set. > > > > With > > > R.version.string > [1] "R version 3.2.2 Patched (2015-08-25 r69179)" > > the behavior is to warn with an indication of the repository for which the > problem occurs > > > URL <- "http://www.stats.ox.ac.uk/pub/RWin" > > available.packages(URL, method="libcurl") > Warning: unable to access index for repository > http://www.stats.ox.ac.uk/pub/RWin: > Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > Package Version Priority Depends Imports LinkingTo Suggests Enhances > License License_is_FOSS License_restricts_use OS_type Archs MD5sum > NeedsCompilation File Repository > > available.packages(URL, method="internal") > Warning: unable to access index for repository > http://www.stats.ox.ac.uk/pub/RWin: > cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/PACKAGES' > Package Version Priority Depends Imports LinkingTo Suggests Enhances > License License_is_FOSS License_restricts_use OS_type Archs MD5sum > NeedsCompilation File Repository > > Does that work for you / address the problem? > > Martin > > >> sessionInfo() > > R version 3.2.2 (2015-08-14) > > Platform: x86_64-apple-darwin13.4.0 (64-bit) > > Running under: OS X 10.10.4 (Yosemite) > > > > locale: > > [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] testthat_0.8.1.0.99 knitr_1.11 devtools_1.5.0.9001 > > [4] BiocInstaller_1.15.5 > > > > loaded via a namespace (and not attached): > > [1] httr_1.0.0 R6_2.0.0.9000 tools_3.2.2 parallel_3.2.2 > whisker_0.3-2 > > [6] RCurl_1.95-4.1 memoise_0.2.1 stringr_0.6.2 digest_0.6.4 > evaluate_0.7.2 > > > > Thanks, > > Kevin > > > > ______________________________________________ > > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 >-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Kevin Ushey
2015-Aug-25 21:41 UTC
[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)
In fact, this does reproduce on R-devel: > options(download.file.method = "libcurl") > options(repos = c(CRAN = "https://cran.rstudio.com/", CRANextra + "http://www.stats.ox.ac.uk/pub/RWin")) > install.packages("lattice") ## could be any package Installing package into ?/Users/kevinushey/Library/R/3.3/library? (as ?lib? is unspecified) Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! > sessionInfo() R Under development (unstable) (2015-08-14 r69078) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.4 (Yosemite) I think this could be problematic for users with custom CRAN repositories. For example, if I have a CRAN repository that only serves source packages (no binary packages), this implies that any R session configured to download binary packages would fail to download any packages at all (as it would barf on attempting to read the non-existent PACKAGES file for the 'binary' branch of the custom repository). This can also be seen by attempting to install a package using current R-devel (since no binaries are made available for R 3.3): > options(download.file.method = "libcurl") > options(repos = c(CRAN = "https://cran.rstudio.com/")) > print(getOption("pkgType")) [1] "both" > install.packages("lattice") Installing package into ?/Users/kevinushey/Library/R/3.3/library? (as ?lib? is unspecified) Error in install.packages : Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! The same error (with a different, XML response) is returned when using e.g. `https://cran.fhcrc.org`. Kevin On Tue, Aug 25, 2015 at 1:33 PM, Martin Morgan <mtmorgan at fredhutch.org> wrote:> On 08/25/2015 01:30 PM, Kevin Ushey wrote: >> >> Hi Martin, >> >> Indeed it does (and I should have confirmed myself with R-patched and >> R-devel >> before posting...) > > > actually I don't know that it does -- it addresses the symptom but I think > there should be an error from libcurl on the 403 / 404 rather than from > read.dcf on error page... > > Martin > > >> >> Thanks, and sorry for the noise. >> Kevin >> >> >> On Tue, Aug 25, 2015, 13:11 Martin Morgan <mtmorgan at fredhutch.org >> <mailto:mtmorgan at fredhutch.org>> wrote: >> >> On 08/25/2015 12:54 PM, Kevin Ushey wrote: >> > Hi all, >> > >> > The following fails for me (on OS X, although I imagine it's the >> same >> > on other platforms using libcurl): >> > >> > options(download.file.method = "libcurl") >> > options(repos = c(CRAN = "https://cran.rstudio.com/", >> CRANextra >> > "http://www.stats.ox.ac.uk/pub/RWin")) >> > install.packages("lattice") ## could be any package >> > >> > gives me: >> > >> > > options(download.file.method = "libcurl") >> > > options(repos = c(CRAN = "https://cran.rstudio.com/", >> CRANextra >> > = "http://www.stats.ox.ac.uk/pub/RWin")) >> > > install.packages("lattice") ## coudl be any package >> > Installing package into >> ?/Users/kevinushey/Library/R/3.2/library? >> > (as ?lib? is unspecified) >> > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! >> > >> > This seems to come from a call to `available.packages()` to a URL >> that >> > doesn't exist on the server (likely when querying PACKAGES on the >> > CRANextra repo) >> > >> > Eg. >> > >> > > URL <- "http://www.stats.ox.ac.uk/pub/RWin" >> > > available.packages(URL, method = "internal") >> > Warning: unable to access index for repository >> > http://www.stats.ox.ac.uk/pub/RWin >> > Package Version Priority Depends Imports LinkingTo >> Suggests >> > Enhances License License_is_FOSS >> > License_restricts_use OS_type Archs MD5sum >> NeedsCompilation >> > File Repository >> > > available.packages(URL, method = "libcurl") >> > Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! >> > >> > It looks like libcurl downloads and retrieves the 403 page itself, >> > rather than reporting that it was actually forbidden, e.g.: >> > >> > > >> >> download.file("http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz", >> > tempfile(), method = "libcurl") >> > trying URL >> >> 'http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/mavericks/contrib/3.2/PACKAGES.gz' >> > Content type 'text/html; charset=iso-8859-1' length 339 bytes >> > =================================================>> > downloaded 339 bytes >> > >> > Using `method = "internal"` gives an error related to the inability >> to >> > access that URL due to the HTTP status 403. >> > >> > The overarching issue here is that package installation shouldn't >> fail >> > even if libcurl fails to access one of the repositories set. >> > >> >> With >> >> > R.version.string >> [1] "R version 3.2.2 Patched (2015-08-25 r69179)" >> >> the behavior is to warn with an indication of the repository for which >> the >> problem occurs >> >> > URL <- "http://www.stats.ox.ac.uk/pub/RWin" >> > available.packages(URL, method="libcurl") >> Warning: unable to access index for repository >> http://www.stats.ox.ac.uk/pub/RWin: >> Line starting '<!DOCTYPE HTML PUBLI ...' is malformed! >> Package Version Priority Depends Imports LinkingTo Suggests >> Enhances >> License License_is_FOSS License_restricts_use OS_type Archs >> MD5sum >> NeedsCompilation File Repository >> > available.packages(URL, method="internal") >> Warning: unable to access index for repository >> http://www.stats.ox.ac.uk/pub/RWin: >> cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/PACKAGES' >> Package Version Priority Depends Imports LinkingTo Suggests >> Enhances >> License License_is_FOSS License_restricts_use OS_type Archs >> MD5sum >> NeedsCompilation File Repository >> >> Does that work for you / address the problem? >> >> Martin >> >> >> sessionInfo() >> > R version 3.2.2 (2015-08-14) >> > Platform: x86_64-apple-darwin13.4.0 (64-bit) >> > Running under: OS X 10.10.4 (Yosemite) >> > >> > locale: >> > [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 >> > >> > attached base packages: >> > [1] stats graphics grDevices utils datasets methods >> base >> > >> > other attached packages: >> > [1] testthat_0.8.1.0.99 knitr_1.11 devtools_1.5.0.9001 >> > [4] BiocInstaller_1.15.5 >> > >> > loaded via a namespace (and not attached): >> > [1] httr_1.0.0 R6_2.0.0.9000 tools_3.2.2 parallel_3.2.2 >> whisker_0.3-2 >> > [6] RCurl_1.95-4.1 memoise_0.2.1 stringr_0.6.2 digest_0.6.4 >> evaluate_0.7.2 >> > >> > Thanks, >> > Kevin >> > >> > ______________________________________________ >> > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793
Jeroen Ooms
2015-Aug-26 22:04 UTC
[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)
On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan at fredhutch.org> wrote:> > actually I don't know that it does -- it addresses the symptom but I think there should be an error from libcurl on the 403 / 404 rather than from read.dcf on error page...Indeed, the only correct behavior is to turn the protocol error code into an R exception. When the server returns a status code >= 400, it indicates that the request was unsuccessful and the response body does not contain the content the client had requested, but should instead be interpreted as an error message/page. Ignoring this fact and proceeding with parsing the body as usual is incorrect and leads to all kind of strange errors downstream. The other download methods did this correctly, it is unclear why the current implementation of the "libcurl" method does not. Not only does it lead to hard to interpret downstream parsing errors, it also makes the behavior of R ambiguous as it is dependent on which download method is in use. It is certainly not a limitation of the libcurl library: the 'curl' package has alternative implementations of url() and download.file() which exercise the correct behavior. I can only speculate, but if the motivation is to explicitly support retrieval of error pages, perhaps the download.file() and url() functions can gain an argument 'stop_on_error' or something similar which give the user an option to ignore server errors. However this behavior should certainly not be the default. When a function or script contains a line like this: download.file("https://someserver.com/mydata.csv", "mydata.csv") Then in the next line of code we must be able to expect that the file "mydata.csv" we have downloaded to our disk is in fact the file "mydata.csv" that was requested from the server. An implementation that instead saves an error page (likely html content) to the "mydata.csv" file is simply incorrect and will lead to obvious problems, even with a warning. [1] https://www.opencpu.org/posts/cran-https/
Apparently Analagous Threads
- Issues with libcurl + HTTP status codes (eg. 403, 404)
- Issues with libcurl + HTTP status codes (eg. 403, 404)
- Issues with libcurl + HTTP status codes (eg. 403, 404)
- Issues with libcurl + HTTP status codes (eg. 403, 404)
- Issues with libcurl + HTTP status codes (eg. 403, 404)