hi all, i''m a rails fresher who''s wanting to break into this thing with a certain project but i need a few pointers on how to begin. just some specific websites that can help would be very good. i''ve a test website with dreamhost.com, so my project is entirely webhost-based currently, and i connect with ssh and ftp. i''d like to provide another point of access for a mailman mailinglist whose archives are located on another website. the mailinglist archive is currently quite naff looking and hard to navigate. i''d like to build a rails app that will effectively mirror the content but improve your ability to read threaded discussion, post to the list etc., and various other tricks that will improve the enjoyment factor. (no ''what''s the point of that'' comments please :) it''s as much for a project to do as anything else.) the part that i''m stuck on is how to get ruby to crawl this other site, and effectively index the posts of the mailinglist archive, since i don''t have access to the actual mailinglist logs. my newbie head is thinking a good way would be for a cron job to execute a script that crawls the site and adds new posts to its own database every day or so, and for rails to to use that database. any other ideas are welcome. so .. i''ve searched for information on how to make rails read content from another website but i haven''t had any luck, and being new to ruby as well i''m not sure even which keywords to use when searching. i''ve found file.open() and net/http, but nothing yet that shows whether these can help you read from another website. best luke
On Jun 29, 2005, at 10:57 PM, luke wrote:> > <snip> > > best > luke > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >Hey Luke- I am doing something kind of similar. I have to fetch the contents of a few pages from another server that has a proprietary front end to our news database and it only runs on Mac os9. So there is no way for me to run it on my linux servers where my rails app lives. So I have the following model in my rails app: require ''net/http'' class Page < ActiveRecord::Base def self.fetch(page) Net::HTTP.start("192.168.0.2") do |http| data = http.get("/#{page}") data.body.gsub!(/\/temporaryimages/, "http://192.168.0.2/ temporaryimages") data.body.gsub!(/\/wrappers\/(\d+)\.news/i, ''display/\1'') data.body.gsub!(/\/premium\/(\d+)\.news/i, ''premium/\1'') end end end And then I can call this in my controller: @content = Page.fetch("index.php") And then in my view: <%= @content %> Basically I am pulling the body of a remote html file from the other server and then doing some gsub! replacement and re-arranging of the html I get back. If you control the other website maybe you could put some custom html comments wrapping each post to your list like this: <!--begin--> Mailing list post goes here.. <!--end--> Then you could pull from the site one page at a time and run through the fetched content with a regex in a loop that pulls each post between the custom tags and puts it in an array. Then you could load that array into active record and do whatever you want with it in rails. Of course there is more overhead when you have to render content from a remote webserver so caching is very important. Hope that helps, if you decide to go this route let me know if you need any more help. Cheers- -Ezra Zygmuntowicz Yakima Herald-Republic WebMaster 509-577-7732 ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org
In perl, the WWW::Mechanize module is commonly used for page retrieval and link traversal - someone''s attempted a ruby port, I haven''t tried it yet though: http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc On 6/30/05, Ezra Zygmuntowicz <ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org> wrote:> > On Jun 29, 2005, at 10:57 PM, luke wrote: > > > > > <snip> > > > > best > > luke > > > > > > > > _______________________________________________ > > Rails mailing list > > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > > http://lists.rubyonrails.org/mailman/listinfo/rails > > > > Hey Luke- > I am doing something kind of similar. I have to fetch the > contents of a few pages from another server that has a proprietary > front end to our news database and it only runs on Mac os9. So there > is no way for me to run it on my linux servers where my rails app > lives. So I have the following model in my rails app: > > > require ''net/http'' > > class Page < ActiveRecord::Base > > def self.fetch(page) > Net::HTTP.start("192.168.0.2") do |http| > data = http.get("/#{page}") > data.body.gsub!(/\/temporaryimages/, "http://192.168.0.2/ > temporaryimages") > data.body.gsub!(/\/wrappers\/(\d+)\.news/i, ''display/\1'') > data.body.gsub!(/\/premium\/(\d+)\.news/i, ''premium/\1'') > > end > end > > end > > And then I can call this in my controller: > @content = Page.fetch("index.php") > > And then in my view: > <%= @content %> > > Basically I am pulling the body of a remote html file from the > other server and then doing some gsub! replacement and re-arranging > of the html I get back. If you control the other website maybe you > could put some custom html comments wrapping each post to your list > like this: > > <!--begin--> > Mailing list post goes here.. > <!--end--> > > Then you could pull from the site one page at a time and run > through the fetched content with a regex in a loop that pulls each > post between the custom tags and puts it in an array. Then you could > load that array into active record and do whatever you want with it > in rails. Of course there is more overhead when you have to render > content from a remote webserver so caching is very important. Hope > that helps, if you decide to go this route let me know if you need > any more help. > > Cheers- > -Ezra Zygmuntowicz > Yakima Herald-Republic > WebMaster > 509-577-7732 > ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Thankyou so much both Ezra and Marcus. Knowing the terms net::http and gsub meant I googled this which looks like a promising guide as well! www.linux-magazine.com/issue/51/Ruby_Web_Spiders.pdf So now to work! Thanks Luke ----- Original Message ----- From: "Ezra Zygmuntowicz" <ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org> To: "luke" <lduncalfe-ZKwmMI9HCDA@public.gmane.org>; <rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org> Sent: Friday, July 01, 2005 3:31 AM Subject: Re: [Rails] adding a remote site''s content to a database | | On Jun 29, 2005, at 10:57 PM, luke wrote: | | > | > <snip> | > | > best | > luke | > | > | > | > _______________________________________________ | > Rails mailing list | > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org | > http://lists.rubyonrails.org/mailman/listinfo/rails | > | | Hey Luke- | I am doing something kind of similar. I have to fetch the | contents of a few pages from another server that has a proprietary | front end to our news database and it only runs on Mac os9. So there | is no way for me to run it on my linux servers where my rails app | lives. So I have the following model in my rails app: | | | require ''net/http'' | | class Page < ActiveRecord::Base | | def self.fetch(page) | Net::HTTP.start("192.168.0.2") do |http| | data = http.get("/#{page}") | data.body.gsub!(/\/temporaryimages/, "http://192.168.0.2/ | temporaryimages") | data.body.gsub!(/\/wrappers\/(\d+)\.news/i, ''display/\1'') | data.body.gsub!(/\/premium\/(\d+)\.news/i, ''premium/\1'') | | end | end | | end | | And then I can call this in my controller: | @content = Page.fetch("index.php") | | And then in my view: | <%= @content %> | | Basically I am pulling the body of a remote html file from the | other server and then doing some gsub! replacement and re-arranging | of the html I get back. If you control the other website maybe you | could put some custom html comments wrapping each post to your list | like this: | | <!--begin--> | Mailing list post goes here.. | <!--end--> | | Then you could pull from the site one page at a time and run | through the fetched content with a regex in a loop that pulls each | post between the custom tags and puts it in an array. Then you could | load that array into active record and do whatever you want with it | in rails. Of course there is more overhead when you have to render | content from a remote webserver so caching is very important. Hope | that helps, if you decide to go this route let me know if you need | any more help. | | Cheers- | -Ezra Zygmuntowicz | Yakima Herald-Republic | WebMaster | 509-577-7732 | ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org | |
Hi Luke, I didn''t something pretty similar for my railsday entry - you might want to have a look at my code at http://railsday.com/svn/railsday6/. I used Rmail to import the mbox archives that most mailman archives provide, eg: List.transaction do RMail::Mailbox.parse_mbox(File.open(path)) do |raw| message = RMail::Parser.read(raw) m = Message.new_from_rmail(message) m.list = self m.save end end Hope this might save you some time, Hadley
hi hadley, now that''s a great head start :). very very much what i had in mind. thankyou. ----- Original Message ----- From: "hadley wickham" <h.wickham-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> To: "luke" <lduncalfe-ZKwmMI9HCDA@public.gmane.org>; <rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org> Sent: Friday, July 01, 2005 2:35 PM Subject: Re: [Rails] adding a remote site''s content to a database Hi Luke, I didn''t something pretty similar for my railsday entry - you might want to have a look at my code at http://railsday.com/svn/railsday6/. I used Rmail to import the mbox archives that most mailman archives provide, eg: List.transaction do RMail::Mailbox.parse_mbox(File.open(path)) do |raw| message = RMail::Parser.read(raw) m = Message.new_from_rmail(message) m.list = self m.save end end Hope this might save you some time, Hadley