Simon Kiss
2012-May-11 17:14 UTC
[R] Using xpathapply or getnodeset to get text between two distinct tags
Hello: The following code extracts the links to the daily transcripts of Canada's House Of Commons. 'links' is a matrix of URLs (ncol=1), each of which points to one day's transcripts. If you inspect the code for scrape(links[1]), you will find that periodically there appears an italicitze tag after a paragraph tag (<p some text ><i>Translation</i></p>. At this point, the speaker is speaking French. Then there are some <div> tags that list some text, and then, after the speaker has returned to English, you get the same formula as above, <p some text><i>English</i></p><div> some speech </div><div>Some Speech </div> Ultimately, what I'd like to do i count the words between the <i> tags 'Tanslation' and 'English'. I'm pretty sure I can get the text into the tm package to do the word counts, what I really don't know how to is return the text between 'Translation' and 'English' so that I can mark it as 'French' and then return the text between 'English' and 'Translation' and mark it as English. Does any one have any suggestions? Yours truly, Simon J. Kiss #Necessary libraries library(XML) library(scrapeR) #URL for links to 2012 transcripts hansard<-c('http://www.parl.gc.ca/housechamberbusiness/ChamberSittings.aspx?View=H&Language=E&Mode=1&Parl=41&Ses=1') #Scrape the page with the links doc<-scrape(url=hansard, parse=TRUE, follow=TRUE) #Not sure what exactly this does, but it is necessary doc<-doc[[1]] #Get the xmlRoot directory doc<- xmlRoot(doc) #Get nodes that contain only the links to each day's transcripts links<- getNodeSet(doc, "//a[@class='PublicationCalendarLink']/@href") links<-matrix(links) #Paste those href links to the root URL links<-apply(links, 1, function(x) paste('http://www.parl.gc.ca', x, sep='')) #Inspect links[1] #Scrape text from first URL in 'links' oneday<-scrape(links[1])[[1]] #Return p/i elements from 'oneday' getNodeset(oneday, "//p/i") #sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] scrapeR_0.1.6 RCurl_1.91-1 bitops_1.0-4.1 XML_3.9-4 loaded via a namespace (and not attached): [1] tools_2.15.0 ********************************* Simon J. Kiss, PhD Assistant Professor, Wilfrid Laurier University 73 George Street Brantford, Ontario, Canada N3T 2C9 Cell: +1 905 746 7606