Peter Szinek
2007-Mar-26  08:03 UTC
[ANN] scRUBYt! - Hpricot and WWW::Mechanize on even more steroids, 0.2.6 released
Hello all,
scRUBYt! version 0.2.6 has been released with some great new features, 
tons of bugfixes and lot of changes overall which should greatly affect 
the reliability of the system.
===========What''s this?
===========
scRUBYt! is a very easy to learn and use, yet powerful Web scraping 
framework based on Hpricot and mechanize. It''s purpose is to free you 
from the drudgery of web page crawling, looking up HTML tags, 
attributes, XPaths, form names and other typical low-level web scraping 
woes by figuring these out from your examples copy''n''pasted
from the Web
page.
==========What''s new?
==========
A lot of long-awaited features have been added: most notably, automatic 
crawling to the detail pages, which was the most requested feature in 
scRUBYt!s history ever.
Another great addition is the improved example generation - you dont 
have to use the whole text of the element you would like to match 
anymore - it is enough to specify a substring, and the first element 
that contains the string will be returned. Moreover, it is possible to 
create compound examples like this:
flight :begins_with => ''Arrival'', :contains /\d{4}/,
:ends_with => ''20:00''
The crawling through next links has been greatly improved - it is 
possible to use images as next links, to generate URLs instead of 
clicking on the next link, and a great deal of bugs (including the 
infamous  google next link problem) have been fixed.
An enormous amount of bugs  were fixed and the whole system was tested 
thoroughly, so the overall reliability should be improved a lot as 
opposed to the previous releases.
Something non-software related: 4 people have joined the development, so 
I guess there is much, much more to come in the future!
========CHANGELOG
========
* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern
    had to be a string. Now it can be a hash as well, like
   {:contains => /\d\d-\d/, :begins_with => ''Telephone''}
* [NEW] More sophisticated example specification: Possible to use regexp
    as well, and need not (but still possible of course) to specify the
    whole content of the node - nodes that contain the string/match the
    regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did
    not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different
    methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
    some_url ''href'', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the
    next link text is not an <a>, traverse down until the <a> is
found;
    if it is still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed
    out (or otherwise present but has no href attribute (Credit: Robert
    Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as
    it uses the ''standard'' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and
    stabilization
===========Announcement
===========
On popular demand, there is a new forum to discuss everything scRUBYt! 
related:
http://agora.scrubyt.org
You are welcome to sign up tell your opinion, ask for features, report 
bugs or discuss stuff - or to just look around what other''s are saying.
===============Closing thoughts
===============
Please keep the feedback coming - your contributions are a key factor to 
scRUBYt!s succes. This is not an exaggeration or a feeble attempt at 
flattery - since we (obviously) can not test everything on every 
possible page, we can make scRUBYt! truly powerful only if you send us 
all the quirks and problems you encounter during scraping, as well as 
your suggestions and ideas. Thanks everyone!
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---