thr3ads.net - R help - [R] Scraping a web page. [May 2012]

If this information is useful, please help other people find it:
Share via:

Keith Weintraub

2012-May-14 21:17 UTC

[R] Scraping a web page.

Folks,
  I want to scrape a series of web-page sources for strings like the following:

"/en/Ships/A-8605507.html"
"/en/Ships/Aalborg-8122830.html"

which appear in an href inside an <a> tag inside a <div> tag inside
a table.

In fact all I want is the (exactly) 7-digit number before ".html".

The good news is that as far as I can tell the the <a> tag is always on
it's own line so some kind of line-by-line grep should suffice once I figure
out the following:

What is the best package/command to use to get the source of a web page. I tried
using something like:
if(url.exists("http://www.omegahat.org/RCurl")) {
  h = basicTextGatherer()
  curlPerform(url = "http://www.omegahat.org/RCurl", writefunction =
h$update)
   # Now read the text that was cumulated during the query response.
  h$value()
}

which works except that I get one long streamed html doc without the line
breaks.


Thanks in advance for your help,
KW


	[[alternative HTML version deleted]]

J Toll

2012-May-14 23:18 UTC

head link

[R] Scraping a web page.

On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub <kw1958 at gmail.com>
wrote:> Folks,
> ?I want to scrape a series of web-page sources for strings like the
following:
>
> "/en/Ships/A-8605507.html"
> "/en/Ships/Aalborg-8122830.html"
>
> which appear in an href inside an <a> tag inside a <div> tag
inside a table.
>
> In fact all I want is the (exactly) 7-digit number before
".html".
>
> The good news is that as far as I can tell the the <a> tag is always
on it's own line so some kind of line-by-line grep should suffice once I
figure out the following:
>
> What is the best package/command to use to get the source of a web page. I
tried using something like:
> if(url.exists("http://www.omegahat.org/RCurl")) {
> ?h = basicTextGatherer()
> ?curlPerform(url = "http://www.omegahat.org/RCurl", writefunction
= h$update)
> ? # Now read the text that was cumulated during the query response.
> ?h$value()
> }
>
> which works except that I get one long streamed html doc without the line
breaks.
You could use:

h <- readLines("http://www.omegahat.org/RCurl")

-- or --

download.file(url = "http://www.omegahat.org/RCurl", destfile =
"tmp.html")
h = scan("tmp.html", what = "", sep = "\n")

and then use grep or the XML package for processing.

HTH

James

Keith Weintraub

2012-May-16 19:29 UTC

head link

[R] Scraping a web page.

Thanks Gabor,
  Nifty regexp. I never used strapplyc before and I am sure this will become a
nice addition to my toolkit.

KW

Message: 5
Date: Tue, 15 May 2012 07:55:33 -0400
From: Gabor Grothendieck <ggrothendieck@gmail.com>
To: Keith Weintraub <kw1958@gmail.com>
Cc: r-help@r-project.org
Subject: Re: [R] Scraping a web page.
Message-ID:
	<CAP01uR=zdxHocxpsZdpT+4Kx2=L2vr9jnr=i=_Qhs39O=QoThg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub <kw1958@gmail.com>
wrote:> 
> Thanks,
> ?That was very helpful.
> 
> I am using readLines and grep. If grep isn't powerful enough I might
end up using the XML package but I hope that won't be necessary.
> 
This only uses readLines and strapplyc (from gsubfn).  It scrape the
relevant strings from your post on nabble and by modifying URL and pat
you can likely get it to work with whatever the format of your
original files is:
library(gsubfn)
URL <-
"http://r.789695.n4.nabble.com/Scraping-a-web-page-tp4630005.html"
L <- readLines(URL)
pat <- '<br/>&quot;/en/Ships.*-(\\d{7}).html&quot;'
strapplyc(L, pat, simplify = c)
The result from the last line is:
[1] "8605507" "8122830"
-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

--


	[[alternative HTML version deleted]]

Keith Weintraub

2012-May-16 20:19 UTC

head link

[R] Scraping a web page.

Duncan,
   Thanks for the advice.

It turns out that the web pages are pretty well behaved.

I ended up using
readHTMLTable
str_select
grep
gsub
readLines

When I have time I am going to convert my code to use the html parser and the
more robust getNodeSet method that you mention below.

Thanks for your detailed reply,
KW

Message: 139
Date: Tue, 15 May 2012 21:02:05 -0700
From: Duncan Temple Lang <duncan@wald.ucdavis.edu>
To: r-help@r-project.org
Subject: Re: [R] Scraping a web page.
Message-ID: <4FB326BD.9080207@wald.ucdavis.edu>
Content-Type: text/plain; charset=ISO-8859-1

Hi Keith

Of course, it doesn't necessarily matter how you get the job done
if it actually works correctly.  But for a general approach,
it is useful to use general tools and can lead to more correct,
more robust, and more maintainable code.

Since htmlParse() in the XML package can both retrieve and parse the HTML
document
 doc = htmlParse(the.url)

is much more succinct than using curlPerform().
However, if you want to use RCurl, just use

   txt = getURLContent(the.url)

and  that replaces

 h = basicTextGatherer()
 curlPerform(url = "http://www.omegahat.org/RCurl", writefunction =
h$update)
 h$value()

If you have parsed the HTML document, you can find the <a> nodes that have
an
href attribute that start with /en/Ships via

 hrefs = unlist(getNodeSet(doc, "//a[starts-with(@href,
'/en/Ships')]/@href"))

The result is a character vector and you can extract the relevant substrings
with
substring() or gsub() or any wrapper of those functions.

There are many benefits of parsing the HTML, including not falling foul of
"as far as I can tell the the <a> tag is always on it's own
line" being not true.

   D.

On 5/15/12 4:06 AM, Keith Weintraub wrote:> Thanks,
>  That was very helpful.
> 
> I am using readLines and grep. If grep isn't powerful enough I might
end up using the XML package but I hope that won't be necessary.
> 
> Thanks again,
> KW
> 
> --
> 
> On May 14, 2012, at 7:18 PM, J Toll wrote:
> 
>> On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub
<kw1958@gmail.com> wrote:
>>> Folks,
>>> I want to scrape a series of web-page sources for strings like the
following:
>>> 
>>> "/en/Ships/A-8605507.html"
>>> "/en/Ships/Aalborg-8122830.html"
>>> 
>>> which appear in an href inside an <a> tag inside a
<div> tag inside a table.
>>> 
>>> In fact all I want is the (exactly) 7-digit number before
".html".
>>> 
>>> The good news is that as far as I can tell the the <a> tag is
always on it's own line so some kind of line-by-line grep should suffice
once I figure out the following:
>>> 
>>> What is the best package/command to use to get the source of a web
page. I tried using something like:
>>> if(url.exists("http://www.omegahat.org/RCurl")) {
>>> h = basicTextGatherer()
>>> curlPerform(url = "http://www.omegahat.org/RCurl",
writefunction = h$update)
>>>  # Now read the text that was cumulated during the query response.
>>> h$value()
>>> }
>>> 
>>> which works except that I get one long streamed html doc without
the line breaks.
>> 
>> You could use:
>> 
>> h <- readLines("http://www.omegahat.org/RCurl")
>> 
>> -- or --
>> 
>> download.file(url = "http://www.omegahat.org/RCurl", destfile
= "tmp.html")
>> h = scan("tmp.html", what = "", sep =
"\n")
>> 
>> and then use grep or the XML package for processing.
>> 
>> HTH
>> 
>> James
> 
> 
> 	[[alternative HTML version deleted]]

--

	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more reasonably related threads

R help - May 2012 - Scraping a web page.

[R] Scraping a web page.

[R] Scraping a web page.

[R] Scraping a web page.

[R] Scraping a web page.

Reasonably Related Threads