thr3ads.net - Rails - [Rails] OT: Scraper library recommendation [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Andrea Campi

2006-Jan-10 17:58 UTC

[Rails] OT: Scraper library recommendation

Hi all,

this is quite off-topic, but I''m sure a lot of people here has
experience
in the area, so...

I''m writing a website scraper script that needs to download a web page,
traverse the (X)HTML tree and finally insert data and HTML pieces into
a DB. Eventually this data will be served up as RSS and/or Atom.

I''m currently using html/tree (htmltools); I''ve also tried
Rubyful Soup;
both have their own shortcomings. What do you people suggest?

Regarding htmltools: I had to tweak it quite a bit, as it wouldn''t
recognize
XHTML-style "empty" tags (for instance, it dislikes <link ...
/>).
What''s even worse, I can''t seem to get it to dump back the
HTML it read.
Something as simple as:

#!/usr/bin/env ruby

require ''html/tree''

p = HTMLTree::Parser.new(false, false)
p.feed("<a href=''about:blank''><img
src=''blah'' /></a>")
p.tree.dump

Results in:

  <a href="about:blank">
    <img src="blah">


Rubyful Soup is not perfect either, quite often spewing things like
<img img="" ...; OTOH, it groks XHTML. But it''s much much
slower...


What do you think? Any pointer, suggestions, ecc. very very welcome!

Bye,
	Andrea

-- 
                   Press every key to continue.

Kevin Olbrich

2006-Jan-11 00:40 UTC

head link

[Rails] Re: OT: Scraper library recommendation

On a related topic....

I''ve been thinking about writing a script that would scrape Rdoc html 
files and then insert descriptions from the code into a table.

The specific reason for this was to provide automagic population of the 
privledge description fields in the ''user_engine''.

I suspect there may be other applications for this as well.
A good HTML scraper library would really help out with this.


_Kevin

-- 
Posted via http://www.ruby-forum.com/.

Reasonably Related Threads

Search for more apparently analagous threads

Rails - Jan 2006 - OT: Scraper library recommendation

[Rails] OT: Scraper library recommendation

[Rails] Re: OT: Scraper library recommendation

Reasonably Related Threads