Hi,
I want to mine web pages and decided to use tm and scrapeR. The example
given in scrapeR's manual runs as follows:
library(scrapeR)
pageSource<-scrape(url="http://cran.r-project.org/web/packages/",headers=TRUE,
parse=FALSE)
if(attributes(pageSource)$headers["status"]==200) {
page<-scrape(object="pageSource")
xpathSApply(page,"//table//td/a",xmlValue)
} else {
cat("There was an error with the page. \n")
}
and returns a list and an error
str(pageSource) gives
List of 1
$ http://cran.r-project.org/web/packages/: atomic [1:1] <!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
.
## I have left out most of the html that was returned.
.
..- attr(*, "headers")= Named chr "<!DOCTYPE html PUBLIC
\"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__
.. ..- attr(*, "names")= chr "<!DOCTYPE html PUBLIC
\"-//W3C//DTD
XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__
I seem to be missing the status attribute in the returned list and the
scrape(object="pageSource") returns a list causing xpathSApply
indigestion!
I am running R 2.15.3 (2013-03-01) on Ubuntu 12.04 with RCurl 1.95-4.1
and libcurl4-gnutls-dev (version 7.22.0-3ubuntu4.1) and libcurl3
(version 7.22.0-3ubuntu4.1). RCurl's basicHeaderGather() function
returns a status of 200 for
http://cran.r-project.org/web/packages/index.html
I assume I have a problem with my libcurl setup....... Any pointers to
fixing this?
Andrew
Andrew Roberts
Oswestry UK
[[alternative HTML version deleted]]