thr3ads.net - R help - [R] Unexpected scrapeR result [May 2013]

If this information is useful, please help other people find it:
Share via:

Andrew Roberts

2013-May-12 07:07 UTC

[R] Unexpected scrapeR result

Hi,

I want to mine web pages and decided to use tm and scrapeR. The example 
given in scrapeR's manual runs as follows:

library(scrapeR)
pageSource<-scrape(url="http://cran.r-project.org/web/packages/",headers=TRUE,
parse=FALSE)
if(attributes(pageSource)$headers["status"]==200) {
   page<-scrape(object="pageSource")
   xpathSApply(page,"//table//td/a",xmlValue)
} else {
   cat("There was an error with the page. \n")
}

and returns a list and an error

str(pageSource) gives

List of 1
  $ http://cran.r-project.org/web/packages/: atomic [1:1] <!DOCTYPE html 
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
.
## I have left out most of the html that was returned.
.
   ..- attr(*, "headers")= Named chr "<!DOCTYPE html PUBLIC 
\"-//W3C//DTD XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__
   .. ..- attr(*, "names")= chr "<!DOCTYPE html PUBLIC
\"-//W3C//DTD
XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__

I seem to be missing the status attribute in the returned list and the 
scrape(object="pageSource") returns a list causing xpathSApply
indigestion!

I am running R 2.15.3 (2013-03-01) on Ubuntu 12.04 with RCurl 1.95-4.1 
and libcurl4-gnutls-dev (version 7.22.0-3ubuntu4.1) and libcurl3 
(version 7.22.0-3ubuntu4.1). RCurl's basicHeaderGather() function 
returns a status of 200 for 
http://cran.r-project.org/web/packages/index.html

I assume I have a problem with my libcurl setup....... Any pointers to 
fixing this?

Andrew

Andrew Roberts
Oswestry UK

	[[alternative HTML version deleted]]

R help - May 2013 - Unexpected scrapeR result

[R] Unexpected scrapeR result