Hi, I want to screen scrape information from some websites (I have permission to do it). I am using the Mechanize plugin. The websites are different from each other, so I need to write a different RoR code to screen scrape each website. There would be hundreds of different websites. Ok, the problem is that I don''t know how to implement this in an elegant and efficient way. My current quick and dirty solution is a model that I call when I want to screen scrape a website: I call it like: Spider.crawl(website_id) It looks like: class Spider < ActiveRecord::Base require ''mechanize'' def crawl(website_id) if(website_id == 1) //Mechanize code for screen scraping website 1 end if(website_id == 2) //Mechanize code for screen scraping website 2 end ..... end end How can I improve that? Is there at least a way to put the code for each website in an external file, so then I can call just the code I need? That way I would avoid working with a model that has thousands of lines... Thanks for your help! -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Here are my, off the top of my head suggestions: Different thor scripts for each website, perhaps a single script to call the rest of them. I did something similar for scraping shopping cart information. Since I needed the same data on every page I wrote a generic crawler which would read the XPath string from the database for each item I wanted to scrape. Worked well. On Jul 12, 5:02 am, aupayo <cres...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi, > > I want to screen scrape information from some websites (I have > permission to do it). > > I am using the Mechanize plugin. The websites are different from each > other, so I need to write a different RoR code to screen scrape each > website. There would be hundreds of different websites. > > Ok, the problem is that I don''t know how to implement this in an > elegant and efficient way. My current quick and dirty solution is a > model that I call when I want to screen scrape a website: > > I call it like: Spider.crawl(website_id) > > It looks like: > > class Spider < ActiveRecord::Base > > require ''mechanize'' > > def crawl(website_id) > > if(website_id == 1) > //Mechanize code for screen scraping website 1 > end > > if(website_id == 2) > //Mechanize code for screen scraping website 2 > end > > ..... > > end > > end > > How can I improve that? > Is there at least a way to put the code for each website in an > external file, so then I can call just the code I need? That way I > would avoid working with a model that has thousands of lines... > > Thanks for your help!-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On 12 July 2011 10:02, aupayo <cresteb-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi, > > I want to screen scrape information from some websites (I have > permission to do it). > > I am using the Mechanize plugin. The websites are different from each > other, so I need to write a different RoR code to screen scrape each > website. There would be hundreds of different websites. > > Ok, the problem is that I don''t know how to implement this in an > elegant and efficient way. My current quick and dirty solution is a > model that I call when I want to screen scrape a website: > > I call it like: Spider.crawl(website_id) > > It looks like: > > class Spider < ActiveRecord::Base > > require ''mechanize'' > > def crawl(website_id) > > if(website_id == 1) > //Mechanize code for screen scraping website 1 > end > > if(website_id == 2) > //Mechanize code for screen scraping website 2 > end > > ..... > > end > > end > > > How can I improve that? > Is there at least a way to put the code for each website in an > external file, so then I can call just the code I need? That way I > would avoid working with a model that has thousands of lines...If you just want to split it up then provide a set of models (not based on ActiveRecord), one for each site and call the scrape method from your switch list (which would be better as a case statement). If you derive them all from a common base then you can put any common code in the base. Colin -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Conrad Taylor
2011-Jul-12 15:33 UTC
Re: different code for each record, how to implement??
On Tue, Jul 12, 2011 at 2:02 AM, aupayo <cresteb-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi, > > I want to screen scrape information from some websites (I have > permission to do it). > > I am using the Mechanize plugin. The websites are different from each > other, so I need to write a different RoR code to screen scrape each > website. There would be hundreds of different websites. > > Ok, the problem is that I don''t know how to implement this in an > elegant and efficient way. My current quick and dirty solution is a > model that I call when I want to screen scrape a website: > > I call it like: Spider.crawl(website_id) > > It looks like: > > class Spider < ActiveRecord::Base > > require ''mechanize'' > > def crawl(website_id) > > if(website_id == 1) > //Mechanize code for screen scraping website 1 > end > > if(website_id == 2) > //Mechanize code for screen scraping website 2 > end > > ..... > > end > > end > > > How can I improve that? > Is there at least a way to put the code for each website in an > external file, so then I can call just the code I need? That way I > would avoid working with a model that has thousands of lines... > > Thanks for your help! > >Hi, you can define a base class which contains all the common information for all your sites. Then you can define a subclass for easy site that inherits from the base class. For example, class Site attr_accessor :name def to_s puts "using #{self.class}#to_s" end def crawl puts "using #{self.class}#crawl" end end class HerSite < Site def crawl puts "using #{self.class}#crawl version 1" end end class HisSite < Site def crawl puts "using #{self.class}#crawl version 2" end end Next, you can define a SiteFactory class for creating an instance of the given class which represents our site. Thus, this can be represented as follows: class SiteFactory def create( site ) site.new end end We can define our Spider class that has single class method that takes an instance of a site and invokes its crawl instance method. class Spider def self.crawl_site( site ) site.crawl end end Putting it all together, we can crawl all of our sites by doing the following: site_factory = SiteFactory.new [ HerSite, HisSite ].each do | klass | site = site_factory.create( klass ) Spider.crawl_site( site ) end Finally, anytime you want to add a new site you just create a class that inherits from class Site that has a single instance called crawl that describes its strategy for navigating the site. There''s an easier way to obtain all the classes that inherit class Site and I leave this as an exercise for you. Good luck, -Conrad> -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Conrad Taylor
2011-Jul-12 15:42 UTC
Re: different code for each record, how to implement??
On Tue, Jul 12, 2011 at 8:33 AM, Conrad Taylor <conradwt-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On Tue, Jul 12, 2011 at 2:02 AM, aupayo <cresteb-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> Hi, >> >> I want to screen scrape information from some websites (I have >> permission to do it). >> >> I am using the Mechanize plugin. The websites are different from each >> other, so I need to write a different RoR code to screen scrape each >> website. There would be hundreds of different websites. >> >> Ok, the problem is that I don''t know how to implement this in an >> elegant and efficient way. My current quick and dirty solution is a >> model that I call when I want to screen scrape a website: >> >> I call it like: Spider.crawl(website_id) >> >> It looks like: >> >> class Spider < ActiveRecord::Base >> >> require ''mechanize'' >> >> def crawl(website_id) >> >> if(website_id == 1) >> //Mechanize code for screen scraping website 1 >> end >> >> if(website_id == 2) >> //Mechanize code for screen scraping website 2 >> end >> >> ..... >> >> end >> >> end >> >> >> How can I improve that? >> Is there at least a way to put the code for each website in an >> external file, so then I can call just the code I need? That way I >> would avoid working with a model that has thousands of lines... >> >> Thanks for your help! >> >> > Hi, you can define a base class which contains all the common information > for all your sites. Then you can define a subclass for easy site that > inherits from the base class. For example, > > class Site > > attr_accessor :name > > def to_s > puts "using #{self.class}#to_s" > end > > def crawl > puts "using #{self.class}#crawl" > end > > end > > class HerSite < Site > def crawl > puts "using #{self.class}#crawl version 1" > end > end > > class HisSite < Site > def crawl > puts "using #{self.class}#crawl version 2" > end > end > > Next, you can define a SiteFactory class for creating an instance of the > given class which represents our site. Thus, this can be represented > as follows: > > class SiteFactory > > def create( site ) > site.new > end > > end >The above class can be refactored as to the following: class SiteFactory def self.create( site ) site.new end end> > We can define our Spider class that has single class method that takes an > instance of a site and invokes its crawl instance method. > > class Spider > > def self.crawl_site( site ) > site.crawl > end > > end > > Putting it all together, we can crawl all of our sites by doing the > following: > > site_factory = SiteFactory.new > > [ HerSite, HisSite ].each do | klass | > site = site_factory.create( klass ) > Spider.crawl_site( site ) > end >Now, we can rewrite our calling routine to the following: [ HerSite, HisSite ].each do | klass | site = SiteFactory.create( klass ) Spider.crawl_site( site ) end Enjoy, -Conrad ps: There''s always something you missed after you click send.> > Finally, anytime you want to add a new site you just create a class that > inherits from class Site that has a single instance called crawl that > describes > its strategy for navigating the site. There''s an easier way to obtain all > the classes that inherit class Site and I leave this as an exercise for you. > > Good luck, > > -Conrad > > >> -- >> >> You received this message because you are subscribed to the Google Groups >> "Ruby on Rails: Talk" group. >> To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To unsubscribe from this group, send email to >> rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> For more options, visit this group at >> http://groups.google.com/group/rubyonrails-talk?hl=en. >> >> >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On 07/12/2011 08:42 AM, Conrad Taylor wrote:> The above class can be refactored as to the following: > > class SiteFactory > def self.create( site ) > site.new > end > endI''m just curious, what exactly is the point of this class?> Now, we can rewrite our calling routine to the following: > > [ HerSite, HisSite ].each do | klass | > site = SiteFactory.create( klass ) > Spider.crawl_site( site ) > endSeems needlessly verbose, why not just get rid of the factory that isn''t doing anything and just do... [ HerSite, HisSite ].each do | klass | Spider.crawl_site(klass.new) end In fact, why not just... Site.subclasses.each { | klass | Spider.crawl_site(klass.new) } Forgive me, I''m a Smalltalker, but this whole explicit factory business and explicit arrays of classes just looks too Java''ish in an object system with meta classes and reflection. Is there some reason you wouldn''t just reflect the subclasses? Is there some reason for a factory that does nothing? Even if you need a factory, why wouldn''t you just use class methods on Site? -- Ramon Leon http://onsmalltalk.com -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Conrad Taylor
2011-Jul-13 03:01 UTC
Re: different code for each record, how to implement??
On Tue, Jul 12, 2011 at 9:46 AM, Ramon Leon <ramon.leon-fDeA0g24QwDby3iVrkZq2A@public.gmane.org>wrote:> On 07/12/2011 08:42 AM, Conrad Taylor wrote: > >> The above class can be refactored as to the following: >> >> class SiteFactory >> def self.create( site ) >> site.new >> end >> end >> > > I''m just curious, what exactly is the point of this class? > > > Now, we can rewrite our calling routine to the following: >> >> [ HerSite, HisSite ].each do | klass | >> site = SiteFactory.create( klass ) >> Spider.crawl_site( site ) >> end >> > > Seems needlessly verbose, why not just get rid of the factory that isn''t > doing anything and just do... > > > [ HerSite, HisSite ].each do | klass | > Spider.crawl_site(klass.new) > end > > In fact, why not just... > > Site.subclasses.each { | klass | Spider.crawl_site(klass.new) } > >Yes, the above is possible but I can see where just getting all the subclasses of an class might night be what you want.> Forgive me, I''m a Smalltalker, but this whole explicit factory business and > explicit arrays of classes just looks too Java''ish in an object system with > meta classes and reflection. Is there some reason you wouldn''t just reflect > the subclasses? Is there some reason for a factory that does nothing? Even > if you need a factory, why wouldn''t you just use class methods on Site?Next, the Ruby language 1.9.2/1.9.3dev doesn''t support a built in method called subclasses like Smalltalk. Thus, one could implement a subclasses method in the Ruby language as follows: class Class def subclasses ObjectSpace.each_object(Class).select { |klass| klass < self } # select all the methods that are derived from self (i.e. Site). end end This requires opening a class called Class and defining a method called subclasses. Furthermore, one can use a built in Ruby hook method call inherited to arrive at the same result. For example, class Site @subclasses = [] class << self attr_reader :subclasses end def self.inherited( klass ) @subclasses << klass end def to_s puts "using #{self.class}#to_s" end def crawl puts "using #{self.class}#crawl version 0" end end Ramon, you''re correct in saying that SiteFactory class could be remove for a much more concise solution. -Conrad> > -- > Ramon Leon > http://onsmalltalk.com > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk@googlegroups.**com<rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > . > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe@**googlegroups.com<rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > . > For more options, visit this group at http://groups.google.com/** > group/rubyonrails-talk?hl=en<http://groups.google.com/group/rubyonrails-talk?hl=en> > . > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Thank you all so much. I did it like you said, with a set of models not based on ActiveRecord. Best regards, Cristóbal On Jul 13, 5:01 am, Conrad Taylor <conra...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On Tue, Jul 12, 2011 at 9:46 AM, Ramon Leon <ramon.l...-fDeA0g24QwDby3iVrkZq2A@public.gmane.org>wrote: > > > > > > > > > > > On 07/12/2011 08:42 AM, Conrad Taylor wrote: > > >> The above class can be refactored as to the following: > > >> class SiteFactory > >> def self.create( site ) > >> site.new > >> end > >> end > > > I''m just curious, what exactly is the point of this class? > > > Now, we can rewrite our calling routine to the following: > > >> [ HerSite, HisSite ].each do | klass | > >> site = SiteFactory.create( klass ) > >> Spider.crawl_site( site ) > >> end > > > Seems needlessly verbose, why not just get rid of the factory that isn''t > > doing anything and just do... > > > [ HerSite, HisSite ].each do | klass | > > Spider.crawl_site(klass.new) > > end > > > In fact, why not just... > > > Site.subclasses.each { | klass | Spider.crawl_site(klass.new) } > > Yes, the above is possible but I can see where just getting all the > subclasses of an > class might night be what you want. > > > Forgive me, I''m a Smalltalker, but this whole explicit factory business and > > explicit arrays of classes just looks too Java''ish in an object system with > > meta classes and reflection. Is there some reason you wouldn''t just reflect > > the subclasses? Is there some reason for a factory that does nothing? Even > > if you need a factory, why wouldn''t you just use class methods on Site? > > Next, the Ruby language 1.9.2/1.9.3dev doesn''t support a built in method > called subclasses like Smalltalk. Thus, one could implement a subclasses > method in the Ruby language as follows: > > class Class > def subclasses > ObjectSpace.each_object(Class).select { |klass| klass < self } # select > all the methods that are derived from self (i.e. Site). > end > end > > This requires opening a class called Class and defining a method called > subclasses. Furthermore, one can use a built in Ruby hook method call > inherited to arrive at the same result. For example, > > class Site > > @subclasses = [] > > class << self > attr_reader :subclasses > end > > def self.inherited( klass ) > @subclasses << klass > end > > def to_s > puts "using #{self.class}#to_s" > end > > def crawl > puts "using #{self.class}#crawl version 0" > end > > end > > Ramon, you''re correct in saying that SiteFactory class could be remove for a > much more concise solution. > > -Conrad > > > > > > > > > > > -- > > Ramon Leon > >http://onsmalltalk.com > > > -- > > You received this message because you are subscribed to the Google Groups > > "Ruby on Rails: Talk" group. > > To post to this group, send email to rubyonrails-talk@googlegroups.**com<rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > > . > > To unsubscribe from this group, send email to > > rubyonrails-talk+unsubscribe@**googlegroups.com<rubyonrails-talk%2Bunsubscr ibe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > > . > > For more options, visit this group athttp://groups.google.com/** > > group/rubyonrails-talk?hl=en<http://groups.google.com/group/rubyonrails-talk?hl=en> > > .-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.