Stephen Berman
2019-May-04 17:04 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
In versions of R prior to 3.6.0 the following invocation succeeds, returning the data frame shown:> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)Dekade Anzahl 1 1900 11467254 2 1910 13023370 3 1920 13434601 4 1930 13296355 5 1940 12121250 6 1950 13191131 7 1960 10587420 8 1970 10944129 9 1980 11279439 10 1990 12052652 But in version 3.6.0 it fails:> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE)Error in file(file, "rt") : cannot open the connection to 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text' In addition: Warning message: In file(file, "rt") : cannot open URL 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': HTTP status was '403 Forbidden' The table at this URL is generated by a query processor and the same failure happens in 3.6.0 with other queries at this website. This website does not appear to serve data via http: replacing https by http in the above gives the same results, and in 3.6.0 the error message contains the URL with http but in the warning message the URL is with https. I have also tried a few other websites that serve (non-generated) tabular data via https (e.g. https://graphchallenge.s3.amazonaws.com/synthetic/gc3/Theory-16-25-81-Bk.tsv) and with these read.table() succeeds in 3.6.0, so the problem isn't https in general. Maybe it has to do with the page being generated rather than static? There's only one reference to https in the 3.6.0 NEWS, concerning libcurl; I can't tell if it's relevant. In case it matters, this is with R packaged for openSUSE, and I've found the above difference between 3.5 and 3.6 on both openSUSE Leap 15.0 and openSUSE Tumbleweed. Steve Berman
Ralf Stubner
2019-May-06 09:12 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
On 04.05.19 19:04, Stephen Berman wrote:> In versions of R prior to 3.6.0 the following invocation succeeds, > returning the data frame shown: > >> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE) > Dekade Anzahl > 1 1900 11467254 > 2 1910 13023370 > 3 1920 13434601 > 4 1930 13296355 > 5 1940 12121250 > 6 1950 13191131 > 7 1960 10587420 > 8 1970 10944129 > 9 1980 11279439 > 10 1990 12052652 > > But in version 3.6.0 it fails: > >> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE) > Error in file(file, "rt") : > cannot open the connection to 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text' > In addition: Warning message: > In file(file, "rt") : > cannot open URL 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': HTTP status was '403 Forbidden'I can reproduce the behavior on Debian using the CRAN supplied package for R 3.6.0. Trying to read the page with 'curl' produces also a 403 error plus some HTML text (in German) explaining that I am treated as a 'robot' due to the supplied User-Agent (here: curl/7.52.1). One suggested solution is to adjust that value which does solve the issue: > options(HTTPUserAgent='mozilla')>read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", header=TRUE) Dekade Anzahl 1 1900 11467254 2 1910 13023370 3 1920 13434601 4 1930 13296355 5 1940 12121250 6 1950 13191131 7 1960 10587420 8 1970 10944129 9 1980 11279439 10 1990 12052652 Other solutions are to simulate a login or to get in touch with DWDS directly. Greetings Ralf -- Ralf Stubner Senior Software Engineer / Trainer daqana GmbH Dortustra?e 48 14467 Potsdam T: +49 331 23 61 93 11 F: +49 331 23 61 93 90 M: +49 162 20 91 196 Mail: ralf.stubner at daqana.com Sitz: Potsdam Register: AG Potsdam HRB 27966 Ust.-IdNr.: DE300072622 Gesch?ftsf?hrer: Dr.-Ing. Stefan Knirsch, Prof. Dr. Dr. Karl-Kuno Kunze -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20190506/534b1d42/attachment.sig>
Stephen Berman
2019-May-06 12:27 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <ralf.stubner at daqana.com> wrote:> On 04.05.19 19:04, Stephen Berman wrote: >> In versions of R prior to 3.6.0 the following invocation succeeds, >> returning the data frame shown: >> >>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>> header=TRUE) >> Dekade Anzahl >> 1 1900 11467254 >> 2 1910 13023370 >> 3 1920 13434601 >> 4 1930 13296355 >> 5 1940 12121250 >> 6 1950 13191131 >> 7 1960 10587420 >> 8 1970 10944129 >> 9 1980 11279439 >> 10 1990 12052652 >> >> But in version 3.6.0 it fails: >> >>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>> header=TRUE) >> Error in file(file, "rt") : >> cannot open the connection to >> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text' >> In addition: Warning message: >> In file(file, "rt") : >> cannot open URL >> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': >> HTTP status was '403 Forbidden' > > I can reproduce the behavior on Debian using the CRAN supplied package > for R 3.6.0. Trying to read the page with 'curl' produces also a 403 > error plus some HTML text (in German) explaining that I am treated as a > 'robot' due to the supplied User-Agent (here: curl/7.52.1). One > suggested solution is to adjust that value which does solve the issue: > > > options(HTTPUserAgent='mozilla')I confirm that works for me, too. Thanks! FWIW, the default value of HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64 linux-gnu)", and using this (in R 3.6) fails as I reported, while the default value of HTTPUserAgent in R 3.5 here is "R (3.5.0 x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5) succeeds. However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0" fails just as it does in 3.6. It's not clear to me if this particular website is being too restrictive or if R 3.6 should deal with it, or at least mention the issue in NEWS or somewhere else. Steve Berman