thr3ads.net - R help - [R] RCurl unable to download a particular web page -- what is so special about this web page? [Jan 2009]

If this information is useful, please help other people find it:
Share via:

clair.crossupton at googlemail.com

2009-Jan-26 13:58 UTC

[R] RCurl unable to download a particular web page -- what is so special about this web page?

Dear R-help,

There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:
> library(RCurl)
> my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
> getURL(my.url)[1] ""


Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?



> library(RDCOMClient)
> my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
> ie <- COMCreate("InternetExplorer.Application")
> txt <- list()
> ie$Navigate(my.url)
NULL> while(ie[["Busy"]]) Sys.sleep(1)
> txt[[my.url]] <-
ie[["document"]][["body"]][["innerText"]]
> txt$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] "Skip to article Try Electronic Edition Log ...


Many thanks for your time,
C.C

Windows Vista, running with administrator privileges.> sessionInfo()R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base

other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0

loaded via a namespace (and not attached):
[1] tools_2.8.1

Tony Breyal

2009-Jan-26 15:15 UTC

head link

[R] RCurl unable to download a particular web page -- what is so special about this web page?

Hi, i ran your getURL example and had the same problem with
downloading the file.

## R Start..> library(RCurl)
>
toString(getURL("http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"))[1] ""
## R end.

However, if it is interesting that if  you manually save the page to
your desktop, getURL works fine on it:

## R Start..> library(URL)
>
toString(getURL('file:////PFO-SBS001//Redirected//tonyb//Desktop//webpage.html'))[1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<!DOCTYPE HTML PUBLIC
\"-//W3C//DTD
HTML 4.01 Transitional//EN\"
\"http://www.w3.org/TR/html4/loose.dtd\">
\n<html>\n<head>\n\
[etc...]
## R end.


very strange indeed.I use RCurl for web crawling every now and again
so i would be interested in knowing why this happens too :-)

Tony Breyal



On 26 Jan, 13:58, "clair.crossup... at googlemail.com"
<clair.crossup... at googlemail.com> wrote:> Dear R-help,
>
> There seems to be a web page I am unable to download using RCurl. I
> don't understand why it won't download:
>
> > library(RCurl)
> > my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> > getURL(my.url)
>
> [1] ""
>
> Other web pages are ok to download but this is the first time I have
> been unable to download a web page using the very nice RCurl package.
> While i can download the webpage using the RDCOMClient, i would like
> to understand why it doesn't work as above please?
>
> > library(RDCOMClient)
> > my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
> > ie <- COMCreate("InternetExplorer.Application")
> > txt <- list()
> > ie$Navigate(my.url)
> NULL
> > while(ie[["Busy"]]) Sys.sleep(1)
> > txt[[my.url]] <-
ie[["document"]][["body"]][["innerText"]]
> > txt
>
> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
> 07program.html?_r=2`
> [1] "Skip to article Try Electronic Edition Log ...
>
> Many thanks for your time,
> C.C
>
> Windows Vista, running with administrator privileges.> sessionInfo()
>
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods
> base
>
> other attached packages:
> [1] RDCOMClient_0.92-0 RCurl_0.94-0
>
> loaded via a namespace (and not attached):
> [1] tools_2.8.1
>
> ______________________________________________
> R-h... at r-project.org mailing
listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Duncan Temple Lang

2009-Jan-26 16:12 UTC

head link

[R] RCurl unable to download a particular web page -- what is so special about this web page?

clair.crossupton at googlemail.com wrote:> Dear R-help,
> 
> There seems to be a web page I am unable to download using RCurl. I
> don't understand why it won't download:
> 
>> library(RCurl)
>> my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
>> getURL(my.url)
> [1] ""
> 
> 
  I like the irony that RCurl seems to have difficulties downloading an 
article about R.  Good thing it is just a matter of additional arguments
to getURL() or it would be bad news.


The followlocation parameter defaults to FALSE, so

   getURL(my.url, followlocation = TRUE)

gets what you want.

The way I found this  is

  getURL(my.url, verbose = TRUE)

and take a look at the information being sent from R
and received by R from the server.

This gives

* About to connect() to www.nytimes.com port 80 (#0)
*   Trying 199.239.136.200... * connected
* Connected to www.nytimes.com (199.239.136.200) port 80 (#0)
 > GET /2009/01/07/technology/business-computing/07program.html?_r=2 
HTTP/1.1
Host: www.nytimes.com
Accept: */*

< HTTP/1.1 301 Moved Permanently
< Server: Sun-ONE-Web-Server/6.1
< Date: Mon, 26 Jan 2009 16:10:51 GMT
< Content-length: 0
< Content-type: text/html
< Location: 
http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html&OQ=_rQ3D3&OP=42fceb38Q2FQ5DuaRQ5D3-z8Q26--Q24JQ5DJCCQ7BQ5DCMQ5DC1Q5DQ24azf
at -F-Q2ANQ5DRY8h@a88Q3Dz-dbYQ24h at Q2AQ5DC1bQ26-Q2AQ26Q5BdDfQ24dF
<

And the 301 is the critical thing here.

  D.

> Other web pages are ok to download but this is the first time I have
> been unable to download a web page using the very nice RCurl package.
> While i can download the webpage using the RDCOMClient, i would like
> to understand why it doesn't work as above please?
> 
> 
> 
> 
>> library(RDCOMClient)
>> my.url <-
"http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
>> ie <- COMCreate("InternetExplorer.Application")
>> txt <- list()
>> ie$Navigate(my.url)
> NULL
>> while(ie[["Busy"]]) Sys.sleep(1)
>> txt[[my.url]] <-
ie[["document"]][["body"]][["innerText"]]
>> txt
> $`http://www.nytimes.com/2009/01/07/technology/business-computing/
> 07program.html?_r=2`
> [1] "Skip to article Try Electronic Edition Log ...
> 
> 
> Many thanks for your time,
> C.C
> 
> Windows Vista, running with administrator privileges.
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
> 
> other attached packages:
> [1] RDCOMClient_0.92-0 RCurl_0.94-0
> 
> loaded via a namespace (and not attached):
> [1] tools_2.8.1
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jan 2009 - RCurl unable to download a particular web page -- what is so special about this web page?

[R] RCurl unable to download a particular web page -- what is so special about this web page?

[R] RCurl unable to download a particular web page -- what is so special about this web page?

[R] RCurl unable to download a particular web page -- what is so special about this web page?

Possibly Parallel Threads