I would like to be able to submit a list of URLs of various webpages and extract the "content" i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any help would be greatly appreciated - I will not know the structure of the URLs I would submit in advance. Any suggestions on where to look would be greatly appreciated. Mike W. Michael Conklin Chief Methodologist MarketTools, Inc. | www.markettools.com<http://www.markettools.com> 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: 952.417.4719 | CELL: 612.201.8978 This email and attachment(s) may contain confidential and/or proprietary information and is intended only for the intended addressee(s) or its authorized agent(s). Any disclosure, printing, copying or use of such information is strictly prohibited. If this email and/or attachment(s) were received in error, please immediately notify the sender and delete all copies [[alternative HTML version deleted]]
If you only need to grab text it can be conveniently done with lynx. This example is for Windows but its nearly the same on other platforms:> out <- shell("lynx.bat --dump --nolist http://www.google.com", intern TRUE) > head(out)[1] "" [2] " Web Images Videos Maps News Books Gmail more ยป" [3] " iGoogle | Search settings | Sign in" [4] " " [5] " Google" [6] " " On Thu, Dec 3, 2009 at 5:29 PM, Michael Conklin < michael.conklin@markettools.com> wrote:> I would like to be able to submit a list of URLs of various webpages and > extract the "content" i.e. not the mark-up of those pages. I can find plenty > of examples in the XML library of extracting links from pages but I cannot > seem to find a way to extract the text. Any help would be greatly > appreciated - I will not know the structure of the URLs I would submit in > advance. Any suggestions on where to look would be greatly appreciated. > > Mike > > W. Michael Conklin > Chief Methodologist > > MarketTools, Inc. | www.markettools.com<http://www.markettools.com> > 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: > 952.417.4719 | CELL: 612.201.8978 > This email and attachment(s) may contain confidential and/or proprietary > information and is intended only for the intended addressee(s) or its > authorized agent(s). Any disclosure, printing, copying or use of such > information is strictly prohibited. If this email and/or attachment(s) were > received in error, please immediately notify the sender and delete all > copies > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Michael Conklin wrote:> > I would like to be able to submit a list of URLs of various webpages and > extract the "content" i.e. not the mark-up of those pages. I can find > plenty of examples in the XML library of extracting links from pages but I > cannot seem to find a way to extract the text. Any help would be greatly > appreciated - I will not know the structure of the URLs I would submit in > advance. Any suggestions on where to look would be greatly appreciated. > > Mike > > W. Michael Conklin > Chief Methodologist >What kind of "content" are you after? Tables? Chunks of Text? For tables you can use the readHTMLTable() function in the XML package. There was also some discussion of alternate ways to extract data from tables in this thread: http://n4.nabble.com/Downloading-data-from-from-internet-td889838.html#a889845 If you're after text, then it's probably a matter of locating the element that encloses the data you want-- perhaps by using getNodeSet along with an XPath[1] that specifies the element you are interest with. The text can then be recovered using the xmlValue() function. Hope this helps! -Charlie [1]: http://www.w3schools.com/XPath/xpath_syntax.asp -- View this message in context: http://n4.nabble.com/Scraping-a-web-page-tp948069p948103.html Sent from the R help mailing list archive at Nabble.com.
Hi Michael If you just want all of the text that is displayed in the HTML docment, then you might use an XPath expression to get all the text() nodes and get their value. An example is doc = htmlParse("http://www.omegahat.org/") txt = xpathSApply(doc, "//body//text()", xmlValue) The result is a character vector that contains all the text. By limiting the nodes to the body, we avoid the content in <head> such as inlined JavaScript or CSS. It is also possible that a document may have <script> elements in the document containing JavaScript that you don't want. You can omit these txt = xpathSApply(doc, "//body//text()[not(ancestor::script)]", xmlValue) And if there were other elements we wanted to ignore, then you could use txt = xpathSApply(doc, "//body//text()[not(ancestor::script) and not(ancestor::otherElement)]", xmlValue) HTH, D. Michael Conklin wrote:> I would like to be able to submit a list of URLs of various webpages and extract the "content" i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any help would be greatly appreciated - I will not know the structure of the URLs I would submit in advance. Any suggestions on where to look would be greatly appreciated. > > Mike > > W. Michael Conklin > Chief Methodologist > > MarketTools, Inc. | www.markettools.com<http://www.markettools.com> > 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: 952.417.4719 | CELL: 612.201.8978 > This email and attachment(s) may contain confidential and/or proprietary information and is intended only for the intended addressee(s) or its authorized agent(s). Any disclosure, printing, copying or use of such information is strictly prohibited. If this email and/or attachment(s) were received in error, please immediately notify the sender and delete all copies > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.