Simon Kiss
2012-May-11 17:14 UTC
[R] Using xpathapply or getnodeset to get text between two distinct tags
Hello:
The following code extracts the links to the daily transcripts of Canada's
House Of Commons. 'links' is a matrix of URLs (ncol=1), each of which
points to one day's transcripts.
If you inspect the code for scrape(links[1]), you will find that periodically
there appears an italicitze tag after a paragraph tag (<p some text
><i>Translation</i></p>. At this point, the speaker is
speaking French.
Then there are some <div> tags that list some text, and then, after the
speaker has returned to English, you get the same formula as above, <p some
text><i>English</i></p><div> some speech
</div><div>Some Speech </div>
Ultimately, what I'd like to do i count the words between the <i> tags
'Tanslation' and 'English'.
I'm pretty sure I can get the text into the tm package to do the word
counts, what I really don't know how to is return the text between
'Translation' and 'English' so that I can mark it as
'French' and then return the text between 'English' and
'Translation' and mark it as English.
Does any one have any suggestions? Yours truly,
Simon J. Kiss
#Necessary libraries
library(XML)
library(scrapeR)
#URL for links to 2012 transcripts
hansard<-c('http://www.parl.gc.ca/housechamberbusiness/ChamberSittings.aspx?View=H&Language=E&Mode=1&Parl=41&Ses=1')
#Scrape the page with the links
doc<-scrape(url=hansard, parse=TRUE, follow=TRUE)
#Not sure what exactly this does, but it is necessary
doc<-doc[[1]]
#Get the xmlRoot directory
doc<- xmlRoot(doc)
#Get nodes that contain only the links to each day's transcripts
links<- getNodeSet(doc,
"//a[@class='PublicationCalendarLink']/@href")
links<-matrix(links)
#Paste those href links to the root URL
links<-apply(links, 1, function(x) paste('http://www.parl.gc.ca', x,
sep=''))
#Inspect
links[1]
#Scrape text from first URL in 'links'
oneday<-scrape(links[1])[[1]]
#Return p/i elements from 'oneday'
getNodeset(oneday, "//p/i")
#sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scrapeR_0.1.6 RCurl_1.91-1 bitops_1.0-4.1 XML_3.9-4
loaded via a namespace (and not attached):
[1] tools_2.15.0
*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 905 746 7606
