thr3ads.net - R help - [R] Scraping a web page [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Michael Conklin

2009-Dec-03 22:29 UTC

[R] Scraping a web page

I would like to be able to submit a list of URLs of various webpages and extract
the "content" i.e. not the mark-up of those pages. I can find plenty
of examples in the XML library of extracting links from pages but I cannot seem
to find a way to extract the text.  Any help would be greatly appreciated - I
will not know the structure of the URLs I would submit in advance.  Any
suggestions on where to look would be greatly appreciated.

Mike

W. Michael Conklin
Chief Methodologist

MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE: 952.417.4719
| CELL: 612.201.8978
This email and attachment(s) may contain confidential and/or proprietary
information and is intended only for the intended addressee(s) or its authorized
agent(s). Any disclosure, printing, copying or use of such information is
strictly prohibited. If this email and/or attachment(s) were received in error,
please immediately notify the sender and delete all copies


	[[alternative HTML version deleted]]

Gabor Grothendieck

2009-Dec-03 23:03 UTC

head link

[R] Scraping a web page

If you only need to grab text it can be conveniently done with lynx.  This
example is for Windows but its nearly the same on other platforms:
> out <- shell("lynx.bat --dump --nolist http://www.google.com",
intern TRUE)
> head(out)[1] ""
[2] "   Web Images Videos Maps News Books Gmail more »"
[3] "   iGoogle | Search settings | Sign in"
[4] "   "
[5] "                                   Google"
[6] "                                      "

On Thu, Dec 3, 2009 at 5:29 PM, Michael Conklin <
michael.conklin@markettools.com> wrote:
> I would like to be able to submit a list of URLs of various webpages and
> extract the "content" i.e. not the mark-up of those pages. I can
find plenty
> of examples in the XML library of extracting links from pages but I cannot
> seem to find a way to extract the text.  Any help would be greatly
> appreciated - I will not know the structure of the URLs I would submit in
> advance.  Any suggestions on where to look would be greatly appreciated.
>
> Mike
>
> W. Michael Conklin
> Chief Methodologist
>
> MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
> 6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE:
> 952.417.4719 | CELL: 612.201.8978
> This email and attachment(s) may contain confidential and/or proprietary
> information and is intended only for the intended addressee(s) or its
> authorized agent(s). Any disclosure, printing, copying or use of such
> information is strictly prohibited. If this email and/or attachment(s) were
> received in error, please immediately notify the sender and delete all
> copies
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Sharpie

2009-Dec-03 23:12 UTC

head link

[R] Scraping a web page

Michael Conklin wrote:> 
> I would like to be able to submit a list of URLs of various webpages and
> extract the "content" i.e. not the mark-up of those pages. I can
find
> plenty of examples in the XML library of extracting links from pages but I
> cannot seem to find a way to extract the text.  Any help would be greatly
> appreciated - I will not know the structure of the URLs I would submit in
> advance.  Any suggestions on where to look would be greatly appreciated.
> 
> Mike
> 
> W. Michael Conklin
> Chief Methodologist
> 
What kind of "content" are you after? Tables? Chunks of Text?  For
tables
you can use the readHTMLTable() function in the XML package.  There was also
some discussion of alternate ways to extract data from tables in this
thread:

 
http://n4.nabble.com/Downloading-data-from-from-internet-td889838.html#a889845

If you're after text, then it's probably a matter of locating the
element
that encloses the data you want-- perhaps by using getNodeSet along with an
XPath[1] that specifies the element you are interest with.  The text can
then be recovered using the xmlValue() function.

Hope this helps!

-Charlie

  [1]:  http://www.w3schools.com/XPath/xpath_syntax.asp

-- 
View this message in context:
http://n4.nabble.com/Scraping-a-web-page-tp948069p948103.html
Sent from the R help mailing list archive at Nabble.com.

Duncan Temple Lang

2009-Dec-04 01:14 UTC

head link

[R] Scraping a web page

Hi Michael

If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.

An example is

  doc = htmlParse("http://www.omegahat.org/")
  txt = xpathSApply(doc, "//body//text()", xmlValue)

The result is a character vector that contains all the text.

By limiting the nodes to the body, we avoid the content in <head>
such as inlined JavaScript or CSS.

It is also possible that a document may have <script> elements
in the document containing JavaScript that you don't want.
You can omit these

  txt = xpathSApply(doc, "//body//text()[not(ancestor::script)]",
xmlValue)

And if there were other elements we wanted to ignore, then you could use

 txt = xpathSApply(doc,
                   "//body//text()[not(ancestor::script) and
not(ancestor::otherElement)]",
                   xmlValue)


HTH,

 D.


Michael Conklin wrote:> I would like to be able to submit a list of URLs of various webpages and
extract the "content" i.e. not the mark-up of those pages. I can find
plenty of examples in the XML library of extracting links from pages but I
cannot seem to find a way to extract the text.  Any help would be greatly
appreciated - I will not know the structure of the URLs I would submit in
advance.  Any suggestions on where to look would be greatly appreciated.
> 
> Mike
> 
> W. Michael Conklin
> Chief Methodologist
> 
> MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
> 6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE:
952.417.4719 | CELL: 612.201.8978
> This email and attachment(s) may contain confidential and/or proprietary
information and is intended only for the intended addressee(s) or its authorized
agent(s). Any disclosure, printing, copying or use of such information is
strictly prohibited. If this email and/or attachment(s) were received in error,
please immediately notify the sender and delete all copies
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more possibly parallel threads

R help - Dec 2009 - Scraping a web page

[R] Scraping a web page

[R] Scraping a web page

[R] Scraping a web page

[R] Scraping a web page

Reasonably Related Threads