Stephen Berman
2019-May-06 12:27 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <ralf.stubner at daqana.com> wrote:> On 04.05.19 19:04, Stephen Berman wrote: >> In versions of R prior to 3.6.0 the following invocation succeeds, >> returning the data frame shown: >> >>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>> header=TRUE) >> Dekade Anzahl >> 1 1900 11467254 >> 2 1910 13023370 >> 3 1920 13434601 >> 4 1930 13296355 >> 5 1940 12121250 >> 6 1950 13191131 >> 7 1960 10587420 >> 8 1970 10944129 >> 9 1980 11279439 >> 10 1990 12052652 >> >> But in version 3.6.0 it fails: >> >>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>> header=TRUE) >> Error in file(file, "rt") : >> cannot open the connection to >> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text' >> In addition: Warning message: >> In file(file, "rt") : >> cannot open URL >> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': >> HTTP status was '403 Forbidden' > > I can reproduce the behavior on Debian using the CRAN supplied package > for R 3.6.0. Trying to read the page with 'curl' produces also a 403 > error plus some HTML text (in German) explaining that I am treated as a > 'robot' due to the supplied User-Agent (here: curl/7.52.1). One > suggested solution is to adjust that value which does solve the issue: > > > options(HTTPUserAgent='mozilla')I confirm that works for me, too. Thanks! FWIW, the default value of HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64 linux-gnu)", and using this (in R 3.6) fails as I reported, while the default value of HTTPUserAgent in R 3.5 here is "R (3.5.0 x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5) succeeds. However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0" fails just as it does in 3.6. It's not clear to me if this particular website is being too restrictive or if R 3.6 should deal with it, or at least mention the issue in NEWS or somewhere else. Steve Berman
Tomas Kalibera
2019-May-13 10:42 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
On 5/6/19 2:27 PM, Stephen Berman wrote:> On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <ralf.stubner at daqana.com> wrote: > >> On 04.05.19 19:04, Stephen Berman wrote: >>> In versions of R prior to 3.6.0 the following invocation succeeds, >>> returning the data frame shown: >>> >>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>>> header=TRUE) >>> Dekade Anzahl >>> 1 1900 11467254 >>> 2 1910 13023370 >>> 3 1920 13434601 >>> 4 1930 13296355 >>> 5 1940 12121250 >>> 6 1950 13191131 >>> 7 1960 10587420 >>> 8 1970 10944129 >>> 9 1980 11279439 >>> 10 1990 12052652 >>> >>> But in version 3.6.0 it fails: >>> >>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text", >>>> header=TRUE) >>> Error in file(file, "rt") : >>> cannot open the connection to >>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text' >>> In addition: Warning message: >>> In file(file, "rt") : >>> cannot open URL >>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text': >>> HTTP status was '403 Forbidden' >> I can reproduce the behavior on Debian using the CRAN supplied package >> for R 3.6.0. Trying to read the page with 'curl' produces also a 403 >> error plus some HTML text (in German) explaining that I am treated as a >> 'robot' due to the supplied User-Agent (here: curl/7.52.1). One >> suggested solution is to adjust that value which does solve the issue: >> >> > options(HTTPUserAgent='mozilla') > I confirm that works for me, too. Thanks! FWIW, the default value of > HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64 > linux-gnu)", and using this (in R 3.6) fails as I reported, while the > default value of HTTPUserAgent in R 3.5 here is "R (3.5.0 > x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5) > succeeds. However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0" > fails just as it does in 3.6. It's not clear to me if this particular > website is being too restrictive or if R 3.6 should deal with it, or at > least mention the issue in NEWS or somewhere else.This is because (from NEWS:) The default ?user agent? has been changed when accessing http:// ????? and https:// sites using libcurl.? (A site was found which caused ????? libcurl to infinite-loop with the previous default.) This website is ok with the default R user agent specification (also for R 3.6 and R-devel), but it is not ok with "libcurl/...". Setting the user agent to anything starting with "R (" will not help in R 3.6, because it will get automatically changed to "libcurl/..." when libcurl is used (note using wget and curl on the command line fails on this website). I am afraid it has to be solved on the user side (e.g. as hinted in that German text one gets when requesting the page using curl) - R should not attempt to circumvent access restrictions on external websites. Best Tomas> > Steve Berman > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Gábor Csárdi
2019-May-13 10:54 UTC
[Rd] read.table() fails with https in R 3.6 but not in R 3.5
Hi Tomas, On Mon, May 13, 2019 at 11:42 AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote: [...]> This is because (from NEWS:) > > The default ?user agent? has been changed when accessing http:// > and https:// sites using libcurl. (A site was found which caused > libcurl to infinite-loop with the previous default.)Which site was this? Maybe it can be fixed on their end? The current behavior is not really ideal, because the `libcurl/x,y,z` string is not only a default, but as you mention above, anything that start with `R (` is replaced with it, so it is basically impossible to send out a UserAgent that starts with `R (`. This was very surprising to me, and I had to go to the C source code to see why R does not respect my `HTTPUserAgent` option. Would it make sense to document this in `?options`? Actually, the default that includes R's version number seems more sensible to me. Maybe we can just add `libcurl/x.y.z` to that to work around that buggy site? I would be happy to test this and send a patch, if you could let me know which website it was. Thanks! Gabor [...]