clair.crossupton at googlemail.com
2009-Jan-26 13:58 UTC
[R] RCurl unable to download a particular web page -- what is so special about this web page?
Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download:> library(RCurl) > my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" > getURL(my.url)[1] "" Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please?> library(RDCOMClient) > my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" > ie <- COMCreate("InternetExplorer.Application") > txt <- list() > ie$Navigate(my.url)NULL> while(ie[["Busy"]]) Sys.sleep(1) > txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] > txt$`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] "Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges.> sessionInfo()R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1
Tony Breyal
2009-Jan-26 15:15 UTC
[R] RCurl unable to download a particular web page -- what is so special about this web page?
Hi, i ran your getURL example and had the same problem with downloading the file. ## R Start..> library(RCurl) > toString(getURL("http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"))[1] "" ## R end. However, if it is interesting that if you manually save the page to your desktop, getURL works fine on it: ## R Start..> library(URL) > toString(getURL('file:////PFO-SBS001//Redirected//tonyb//Desktop//webpage.html'))[1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\"> \n<html>\n<head>\n\ [etc...] ## R end. very strange indeed.I use RCurl for web crawling every now and again so i would be interested in knowing why this happens too :-) Tony Breyal On 26 Jan, 13:58, "clair.crossup... at googlemail.com" <clair.crossup... at googlemail.com> wrote:> Dear R-help, > > There seems to be a web page I am unable to download using RCurl. I > don't understand why it won't download: > > > library(RCurl) > > my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." > > getURL(my.url) > > [1] "" > > Other web pages are ok to download but this is the first time I have > been unable to download a web page using the very nice RCurl package. > While i can download the webpage using the RDCOMClient, i would like > to understand why it doesn't work as above please? > > > library(RDCOMClient) > > my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." > > ie <- COMCreate("InternetExplorer.Application") > > txt <- list() > > ie$Navigate(my.url) > NULL > > while(ie[["Busy"]]) Sys.sleep(1) > > txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] > > txt > > $`http://www.nytimes.com/2009/01/07/technology/business-computing/ > 07program.html?_r=2` > [1] "Skip to article Try Electronic Edition Log ... > > Many thanks for your time, > C.C > > Windows Vista, running with administrator privileges.> sessionInfo() > > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods > base > > other attached packages: > [1] RDCOMClient_0.92-0 RCurl_0.94-0 > > loaded via a namespace (and not attached): > [1] tools_2.8.1 > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Duncan Temple Lang
2009-Jan-26 16:12 UTC
[R] RCurl unable to download a particular web page -- what is so special about this web page?
clair.crossupton at googlemail.com wrote:> Dear R-help, > > There seems to be a web page I am unable to download using RCurl. I > don't understand why it won't download: > >> library(RCurl) >> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" >> getURL(my.url) > [1] "" > >I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() to www.nytimes.com port 80 (#0) * Trying 199.239.136.200... * connected * Connected to www.nytimes.com (199.239.136.200) port 80 (#0) > GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host: www.nytimes.com Accept: */* < HTTP/1.1 301 Moved Permanently < Server: Sun-ONE-Web-Server/6.1 < Date: Mon, 26 Jan 2009 16:10:51 GMT < Content-length: 0 < Content-type: text/html < Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html&OQ=_rQ3D3&OP=42fceb38Q2FQ5DuaRQ5D3-z8Q26--Q24JQ5DJCCQ7BQ5DCMQ5DC1Q5DQ24azf at -F-Q2ANQ5DRY8h@a88Q3Dz-dbYQ24h at Q2AQ5DC1bQ26-Q2AQ26Q5BdDfQ24dF < And the 301 is the critical thing here. D.> Other web pages are ok to download but this is the first time I have > been unable to download a web page using the very nice RCurl package. > While i can download the webpage using the RDCOMClient, i would like > to understand why it doesn't work as above please? > > > > >> library(RDCOMClient) >> my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2" >> ie <- COMCreate("InternetExplorer.Application") >> txt <- list() >> ie$Navigate(my.url) > NULL >> while(ie[["Busy"]]) Sys.sleep(1) >> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] >> txt > $`http://www.nytimes.com/2009/01/07/technology/business-computing/ > 07program.html?_r=2` > [1] "Skip to article Try Electronic Edition Log ... > > > Many thanks for your time, > C.C > > Windows Vista, running with administrator privileges. >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] RDCOMClient_0.92-0 RCurl_0.94-0 > > loaded via a namespace (and not attached): > [1] tools_2.8.1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.