Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets. So how should i go about analyzing data that is not available in table format. Few chunks of code that i wrote: w <- getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes") write.table(w,"test.txt") t <- readLines(w) readLines also didnt prove out to be of any help. Any help would be highly appreciated. Thanks in advance. -- View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html Sent from the R help mailing list archive at Nabble.com.
Perl seems like a 10x better choice for the task, but try looking at the examples in ?strsplit to get started. -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of antujsrv Sent: Thursday, March 03, 2011 4:23 AM To: r-help at r-project.org Subject: [R] Developing a web crawler Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets. So how should i go about analyzing data that is not available in table format. Few chunks of code that i wrote: w <- getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes") write.table(w,"test.txt") t <- readLines(w) readLines also didnt prove out to be of any help. Any help would be highly appreciated. Thanks in advance. -- View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
Mike Marchywka
2011-Mar-03 14:07 UTC
[R] Developing a web crawler / R "webkit" or something similar?
> Date: Thu, 3 Mar 2011 01:22:44 -0800 > From: antujsrv at gmail.com > To: r-help at r-project.org > Subject: [R] Developing a web crawler > > Hi, > > I wish to develop a web crawler in R. I have been using the functionalities > available under the RCurl package. > I am able to extract the html content of the site but i don't know how to goIn general this can be a big effort but there may be things in text processing packages you could adapt to execute html and javascript. However, I guess what I'd be looking for is something like a "webkit" package or other open source browser with or without an "R" interface. This actually may be an ideal solution for a lot of things as you get all the content handlers of at least some browser. Now that you mention it, I wonder if there are browser plugins to handle "R" content ( I'd have to give this some thought, put a script up as a web page with mime type "test/R" and have it execute it in R. )> about analyzing the html formatted document. > I wish to know the frequency of a word in the document. I am only acquainted > with analyzing data sets. > So how should i go about analyzing data that is not available in table > format. > > Few chunks of code that i wrote: > w <- > getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes") > write.table(w,"test.txt") > t <- readLines(w) > > readLines also didnt prove out to be of any help. > > Any help would be highly appreciated. Thanks in advance. > > > -- > View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Mar 3, 2011, at 4:22 AM, antujsrv wrote:> > I wish to develop a web crawler in R.As Rex said, there are faster languages, but R string processing got better due to the stringr package (R Journal 2010-2). When Hadley is done with it, it will be like having it all in R! -- Alexy
Hi The book whose companion website is here <http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/qclwr.html> deals with many of the things you need for a web crawler, and assignment "other 5" on that site (<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>) is a web crawler. Best, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries
Can i ask a question> Do I need a good math for developing a web crawler ? ( I want to develop a simple web crawler to do something ) -- View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3353291.html Sent from the R help mailing list archive at Nabble.com.
Hi Stefan, Thanks for the links you shared in the post, but i am unable to access the scripts and output. It requires a password. If you can let me know the password for the .rar file of the "scripts_other 5", it would be really helpful. thanks in advance. -- View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3414627.html Sent from the R help mailing list archive at Nabble.com.