Hi,
I want to mine web pages and decided to use tm and scrapeR. The example 
given in scrapeR's manual runs as follows:
library(scrapeR)
pageSource<-scrape(url="http://cran.r-project.org/web/packages/",headers=TRUE,
parse=FALSE)
if(attributes(pageSource)$headers["status"]==200) {
   page<-scrape(object="pageSource")
   xpathSApply(page,"//table//td/a",xmlValue)
} else {
   cat("There was an error with the page. \n")
}
and returns a list and an error
str(pageSource) gives
List of 1
  $ http://cran.r-project.org/web/packages/: atomic [1:1] <!DOCTYPE html 
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
.
## I have left out most of the html that was returned.
.
   ..- attr(*, "headers")= Named chr "<!DOCTYPE html PUBLIC 
\"-//W3C//DTD XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__
   .. ..- attr(*, "names")= chr "<!DOCTYPE html PUBLIC
\"-//W3C//DTD
XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html
xmlns="|
__truncated__
I seem to be missing the status attribute in the returned list and the 
scrape(object="pageSource") returns a list causing xpathSApply
indigestion!
I am running R 2.15.3 (2013-03-01) on Ubuntu 12.04 with RCurl 1.95-4.1 
and libcurl4-gnutls-dev (version 7.22.0-3ubuntu4.1) and libcurl3 
(version 7.22.0-3ubuntu4.1). RCurl's basicHeaderGather() function 
returns a status of 200 for 
http://cran.r-project.org/web/packages/index.html
I assume I have a problem with my libcurl setup....... Any pointers to 
fixing this?
Andrew
Andrew Roberts
Oswestry UK
	[[alternative HTML version deleted]]