All, If anyone is thinking about using either of these packages to screen-scrape then I think you should consider mechanize as an option over rubyfulsoup. I was using rubyfulsoup to scrape html pages via a batch process where performance didn''t matter too very much. I needed to port the functionality into a user process where performance did become an issue. RubyfulSoup was taking about 30 seconds to initialize/load the page prior to any processing being done on the page. This was unacceptable for the user process. I started looking into other options. SCRAPI was one option that seemed really promising but I couldn''t find enough documentation on it to make much headway. It may be a good option for others who are more familiar with CSS Selectors, but that person isn''t me. I then looked into WWW::Mechanize. Most of the reading I found on the internet was related to using this for filling out forms and posting data. It was hard to find good examples for parsing out text values, etc... but this turned out to be a great option. WWW::Mechanize uses hpricot for querying the html document with xpath or css selectors. In my opinion, RubyfulSoup is much easier to learn and use initially. However, WWW::Mechanize is MUCH faster - at least for my needs. The page that was taking over 30 seconds to load into rubyfulsoup takes just a few seconds to load into mechanize (and this is the amount of time it takes to pull it down from the source url). Parsing/searching/extracting is extremely fast and solved my performance problems. I already knew xpath query statements so it was pretty easy. Hopefully someone else can benefit from this before investing a lot of time in rubyfulsoup just to find that it may have performance issues. Regards, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Justin Forder
2006-Dec-03 01:26 UTC
Re: RubyfulSoup vs Mechanize - Suprising Performance...
Michael wrote:> All, > > If anyone is thinking about using either of these packages to > screen-scrape then I think you should consider mechanize as an option > over rubyfulsoup. > > I was using rubyfulsoup to scrape html pages via a batch process where > performance didn''t matter too very much. I needed to port the > functionality into a user process where performance did become an issue. > RubyfulSoup was taking about 30 seconds to initialize/load the page > prior to any processing being done on the page. This was unacceptable > for the user process. > > I started looking into other options. SCRAPI was one option that seemed > really promising but I couldn''t find enough documentation on it to make > much headway. It may be a good option for others who are more familiar > with CSS Selectors, but that person isn''t me. > > I then looked into WWW::Mechanize. Most of the reading I found on the > internet was related to using this for filling out forms and posting > data. It was hard to find good examples for parsing out text values, > etc... but this turned out to be a great option. WWW::Mechanize uses > hpricot for querying the html document with xpath or css selectors. > > In my opinion, RubyfulSoup is much easier to learn and use initially. > However, WWW::Mechanize is MUCH faster - at least for my needs. The > page that was taking over 30 seconds to load into rubyfulsoup takes just > a few seconds to load into mechanize (and this is the amount of time it > takes to pull it down from the source url). > Parsing/searching/extracting is extremely fast and solved my performance > problems. I already knew xpath query statements so it was pretty easy. > > Hopefully someone else can benefit from this before investing a lot of > time in rubyfulsoup just to find that it may have performance issues.I was using regular expressions for some page-scraping, then found out about RubyfulSoup. It seemed like the "proper" way to do things, but I had to abandon it because, for my application, it was intolerably slow. I have to deal with hundreds or thousands of pages, and if the parsing takes much longer than the fetching (over a 0.5Mbit/s connection) that''s no good for me. regards Justin Forder --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Michael wrote:> If anyone is thinking about using either of these packages to > screen-scrape then I think you should consider mechanize as an option > over rubyfulsoup.At a guess, I would use... wget to pull down the page tidy to convert it to XHTML XPath from libxml or similar high-end parser All three engines are written in a C language, not our beloved Ruby. And no Perl, either... -- Phlip http://www.greencheese.us/ZeekLand <-- NOT a blog!!! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Michael, You may spend a little time evaluating hpricot on your data: http://code.whytheluckystiff.net/hpricot/ It''s easy to learn and faster than rubyfulsoup (from the benchmarks I found through Google). Alain --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Vishnu Gopal
2006-Dec-03 14:43 UTC
Re: RubyfulSoup vs Mechanize - Suprising Performance...
Yup, hpricot rocks. Vish On 12/3/06, Alain Ravet <alain.ravet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > > Michael, > > > You may spend a little time evaluating hpricot on your data: > http://code.whytheluckystiff.net/hpricot/ > > It''s easy to learn and faster than rubyfulsoup (from the benchmarks I > found through Google). > > Alain > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Alain Ravet wrote:> Michael, > > > You may spend a little time evaluating hpricot on your data: > http://code.whytheluckystiff.net/hpricot/ > > It''s easy to learn and faster than rubyfulsoup (from the benchmarks I > found through Google). > > AlainAlain, I guess you didn''t read my post closely enough!! I found that Mechanize is way faster than RubyfulSoup and I stated that Mechanize uses hpricot for parsing! ;-) So...I have already spent a little time evaluating it and it was the purpose of my post - to save others from going down the slower path. Thanks, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Justin Forder wrote:> Michael wrote: >> prior to any processing being done on the page. This was unacceptable >> etc... but this turned out to be a great option. WWW::Mechanize uses >> Hopefully someone else can benefit from this before investing a lot of >> time in rubyfulsoup just to find that it may have performance issues. > > I was using regular expressions for some page-scraping, then found out > about RubyfulSoup. It seemed like the "proper" way to do things, but I > had to abandon it because, for my application, it was intolerably slow. > I have to deal with hundreds or thousands of pages, and if the parsing > takes much longer than the fetching (over a 0.5Mbit/s connection) that''s > no good for me. > > regards > > Justin ForderJustin, The parsing with mechanize is extremely fast! Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Justin Forder
2006-Dec-03 19:58 UTC
Re: RubyfulSoup vs Mechanize - Suprising Performance...
Michael wrote:> Justin Forder wrote: >> Michael wrote: >>> prior to any processing being done on the page. This was unacceptable >>> etc... but this turned out to be a great option. WWW::Mechanize uses >>> Hopefully someone else can benefit from this before investing a lot of >>> time in rubyfulsoup just to find that it may have performance issues.>> I was using regular expressions for some page-scraping, then found out >> about RubyfulSoup. It seemed like the "proper" way to do things, but I >> had to abandon it because, for my application, it was intolerably slow. >> I have to deal with hundreds or thousands of pages, and if the parsing >> takes much longer than the fetching (over a 0.5Mbit/s connection) that''s >> no good for me. >> >> regards >> >> Justin Forder > > Justin, > > The parsing with mechanize is extremely fast! > > Michael >Thanks, I''ll take a look. Justin --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Victor Rosillo
2006-Dec-03 23:03 UTC
Re: RubyfulSoup vs Mechanize - Suprising Performance...
Is there a ruby solution to spider and scrape javascript formed pages, like when a form and it''s options are made with javascript; I have a job where I have to spider and scrape javascript built pages, wish I can do it via ruby solution. Any suggestions? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
brabuhr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
2006-Dec-04 02:03 UTC
Re: RubyfulSoup vs Mechanize - Suprising Performance...
On 12/3/06, Victor Rosillo <victorrosillo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Is there a ruby solution to spider and scrape javascript formed pages, > like when a form and it''s options are made with javascript; I have a > job where I have to spider and scrape javascript built pages, wish I > can do it via ruby solution. Any suggestions?I''ve never used it, but I''ve seen a Ruby extension: http://raa.ruby-lang.org/project/ruby-js/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---