Hello!
I am not experienced enough to know whether I have found a bug or
whether I am just ignorant.
I have been trying to use the tm package to read in material from RSS
2.0 feeds, which has required grappling with writing a reader for that
flavour of XML. I get an error - "Error : 1: EntityRef: expecting
';' -
which I think I've tracked down.
The feed being processed is from Wordpress:
http://scottbw.wordpress.com/feed/
Note that it contains a number of entity references in various places.
The trouble-makers seem to be & references that are the
"&" in a URL
query string.
<media:content
url="http://0.gravatar.com/avatar/a1033a3e5956f5db65e0cc20f5ea167f?s=96&d=identicon&r=G"
medium="image">
AFAIK, this is a correct encoding,
Parsing this with the following two lines followed by inspecting "t"
shows that the & references have been translated to "&"
while other
entity refs have not.
a<-readLines(url(as.character(feeds[2,2])))
t<-XML::xmlTreeParse(a, replaceEntities=FALSE, asText=TRUE)
I'm guessing this is what breaks things when I try to do things with tm:
rss2Reader <- readXML(
spec = list(
Author = list("node", "/item/creator"),
Content = list("node", "/item/description"),
DateTimeStamp = list("function",function(x) as.POSIXlt(Sys.time(),
tz = "GMT")),
Heading = list("node", "/item/title"),
ID = list("function", function(x) tempfile()),
Origin = list("node", "/item/link")),
doc = PlainTextDocument())
rss2Source <- function(x, encoding = "UTF-8")
XMLSource(x, function(tree)
XML::getNodeSet(XML::xmlRoot(tree),"/rss/channel/item"), rss2Reader,
encoding)
feed.rss2 <- rss2Source(url("http://scottbw.wordpress.com/feed/"))
corp1<-Corpus(feed.rss2, readerControl=list(language="en"))
I've googled around for this problem but got nowhere. Have I missed
something?
Any help will be received gratefully; this was supposed to be the easy
part!
Cheers, Adam