Hi all,
this is quite off-topic, but I''m sure a lot of people here has
experience
in the area, so...
I''m writing a website scraper script that needs to download a web page,
traverse the (X)HTML tree and finally insert data and HTML pieces into
a DB. Eventually this data will be served up as RSS and/or Atom.
I''m currently using html/tree (htmltools); I''ve also tried
Rubyful Soup;
both have their own shortcomings. What do you people suggest?
Regarding htmltools: I had to tweak it quite a bit, as it wouldn''t
recognize
XHTML-style "empty" tags (for instance, it dislikes <link ...
/>).
What''s even worse, I can''t seem to get it to dump back the
HTML it read.
Something as simple as:
#!/usr/bin/env ruby
require ''html/tree''
p = HTMLTree::Parser.new(false, false)
p.feed("<a href=''about:blank''><img
src=''blah'' /></a>")
p.tree.dump
Results in:
<a href="about:blank">
<img src="blah">
Rubyful Soup is not perfect either, quite often spewing things like
<img img="" ...; OTOH, it groks XHTML. But it''s much much
slower...
What do you think? Any pointer, suggestions, ecc. very very welcome!
Bye,
Andrea
--
Press every key to continue.