Moser, Gary
2011-Oct-26 23:36 UTC
[R] Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?
Greetings, I am trying to get all of the text from a web page as if I "selected all" on the page, pasted into a text file, and then read in the text file with read.csv(). # this is the actual page I'm trying to acquire text from: web.pg <- readLines("http://www.airweb.org/?page=574") # then parsed in hopes of an easier structure to work with: web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE) Now I have a lovely html tree, but don't know the best way to get just the text components (job descriptions, job titles, etc...) as they appear on the web site. I'd like to do a little text mining and make a wordcloud using the text. Can anybody suggest a method to achieve this result? Thank you, Gary R. Moser Institutional Research Analyst Heald College p <- 415.808.1533 f <- 415.808.1598 gary_moser@heald.edu <mailto:gary_moser@heald.edu> Disclaimer: This communication may contain Heald College confidential and proprietary data. This message is intended only for the personal and confidential use of the designated recipients named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. In addition, if you have received this message in error, please advise the sender by reply email and delete the message. [[alternative HTML version deleted]]
Henrique Dallazuanna
2011-Oct-27 00:04 UTC
[R] Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?
Use XPATH query: web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE, useInternalNodes = TRUE) # Job title xpathApply(web.pg, "//span[@class='normal']//b", xmlValue) On Wed, Oct 26, 2011 at 9:36 PM, Moser, Gary <Gary_Moser at heald.edu> wrote:> Greetings, > > > > I am trying to get all of the text from a web page as if I "selected > all" on the page, pasted into a text file, and then read in the text > file with read.csv(). > > > > # this is the actual page I'm trying to acquire text from: > > web.pg <- readLines("http://www.airweb.org/?page=574") > > > > # then parsed in hopes of an easier structure to work with: > > web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE) > > > > Now I have a lovely html tree, but don't know the best way to get just > the text components (job descriptions, job titles, etc...) as they > appear on the web site. I'd like to do a little text mining and make a > wordcloud using the text. Can anybody suggest a method to achieve this > result? > > > > Thank you, > > > > Gary R. Moser > > Institutional Research Analyst > > Heald College > > p <- 415.808.1533 > > f <- 415.808.1598 > > gary_moser at heald.edu <mailto:gary_moser at heald.edu> > > > > > > Disclaimer: This communication may contain Heald College confidential and proprietary data. This message is intended only for the personal and confidential use of the designated recipients named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. In addition, if you have received this message in error, please advise the sender by reply email and delete the message. > > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O