thr3ads.net - R help - [R] Developing a web crawler [Mar 2011]

If this information is useful, please help other people find it:
Share via:

antujsrv

2011-Mar-03 09:22 UTC

[R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w <-
getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
write.table(w,"test.txt")
t <- readLines(w) 

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context:
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

rex.dwyer at syngenta.com

2011-Mar-03 13:58 UTC

head link

[R] Developing a web crawler

Perl seems like a 10x better choice for the task, but try looking at the
examples in ?strsplit to get started.

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of antujsrv
Sent: Thursday, March 03, 2011 4:23 AM
To: r-help at r-project.org
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w <-
getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
write.table(w,"test.txt")
t <- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.

--
View this message in context:
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

message may contain confidential information. If you are not the designated
recipient, please notify the sender immediately, and delete the original and any
copies. Any use of the message by you is prohibited.

Mike Marchywka

2011-Mar-03 14:07 UTC

head link

[R] Developing a web crawler / R "webkit" or something similar?

> Date: Thu, 3 Mar 2011 01:22:44 -0800
> From: antujsrv at gmail.com
> To: r-help at r-project.org
> Subject: [R] Developing a web crawler
>
> Hi,
>
> I wish to develop a web crawler in R. I have been using the functionalities
> available under the RCurl package.
> I am able to extract the html content of the site but i don't know how
to go
In general this can be a big effort but there may be things in 
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a
"webkit"
package or other open source browser with or without an "R" interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser. 


Now that you mention it, I wonder if there are browser plugins to handle
"R" content ( I'd have to give this some thought, put a script up
as
a web page with mime type "test/R" and have it execute it in R. )


> about analyzing the html formatted document.
> I wish to know the frequency of a word in the document. I am only
acquainted
> with analyzing data sets.
> So how should i go about analyzing data that is not available in table
> format.
>
> Few chunks of code that i wrote:
> w <-
>
getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
> write.table(w,"test.txt")
> t <- readLines(w)
>
> readLines also didnt prove out to be of any help.
>
> Any help would be highly appreciated. Thanks in advance.
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Alexy Khrabrov

2011-Mar-03 14:10 UTC

head link

[R] Developing a web crawler

On Mar 3, 2011, at 4:22 AM, antujsrv wrote:> 
> I wish to develop a web crawler in R.
As Rex said, there are faster languages, but R string processing got better due
to the stringr package (R Journal 2010-2).  When Hadley is done with it, it will
be like having it all in R!

-- Alexy

Stefan Th. Gries

2011-Mar-03 16:40 UTC

head link

[R] Developing a web crawler

Hi

The book whose companion website is here
<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/qclwr.html>
deals with many of the things you need for a web crawler, and
assignment "other 5" on that site
(<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>)
is a web crawler.

Best,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

Evanescence

2011-Mar-14 07:59 UTC

head link

[R] Developing a web crawler

Can i ask a question> Do I need a good math for developing a web crawler ? 
( I want to develop a simple web crawler to do something )

--
View this message in context:
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3353291.html
Sent from the R help mailing list archive at Nabble.com.

antujsrv

2011-Mar-29 12:16 UTC

head link

[R] Developing a web crawler

Hi Stefan,

Thanks for the links you shared in the post, but i am unable to access the
scripts and output. It requires a password. 
If you can let me know the password for the .rar file of the "scripts_other
5", it would be really helpful. 
thanks in advance.


--
View this message in context:
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3414627.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Mar 2011 - Developing a web crawler

[R] Developing a web crawler

[R] Developing a web crawler

[R] Developing a web crawler / R "webkit" or something similar?

[R] Developing a web crawler

[R] Developing a web crawler

[R] Developing a web crawler

[R] Developing a web crawler

Possibly Parallel Threads