I am having a complicated issue here. I am trying to fetch a page from Froogle and parse it via Hpricot to collect data from the products in the search results. sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search The problem is that the HTML on Froogle is seriously broken. I need to get the table row (tr) for each product, and then look in each of that rows td''s for data. But google''s html is full of unclosed tags for their tables that makes Hpricot freak out. Hpricot thinks the tr''s are empty: "<tr valign=\"top\">\n</tr>" So I guess the question is how do I make Hpricot cope with this markup? It obviously works great in the browser. Are there any tools that will convert a string of html to a valid XML or DOM equivalent? It must be possible because web browsers handle it all the time. What I need to be able to do: html = open(''http://foo.com/'').read html = html.clean_markup html = Hpricot(html) --- Here is an oversimplified example of froogle''s of malformed markup: <table> <tr> <td>foo <td>bar <tr> <td>baz <td>boo </table> -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Alex Wayne wrote:> I am having a complicated issue here. I am trying to fetch a page from > Froogle and parse it via Hpricot to collect data from the products in > the search results. > > sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search > > The problem is that the HTML on Froogle is seriously broken. I need to > get the table row (tr) for each product, and then look in each of that > rows td''s for data. But google''s html is full of unclosed tags for > their tables that makes Hpricot freak out. Hpricot thinks the tr''s are > empty: > > "<tr valign=\"top\">\n</tr>" >Heres a better illustration of the problem, from irb: pp Hpricot(''<table><tr><td>foo<td>bar</table>'') # => #<Hpricot::Doc {elem <table> {emptyelem <tr>} {elem <td> {text "foo"}} {elem <td> {text "bar"}} </table>}> the <tr> is empty, and the <td>''s are considered direct children of <table>. So the selector "table tr td" wont work. There is no way to groud td''s by row in this case. -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On 17 Nov 2006, at 00:06, Alex Wayne wrote:> The problem is that the HTML on Froogle is seriously broken.Agreed!> So I guess the question is how do I make Hpricot cope with this > markup? > It obviously works great in the browser. Are there any tools that > will > convert a string of html to a valid XML or DOM equivalent? It must be > possible because web browsers handle it all the time. > > What I need to be able to do: > > html = open(''http://foo.com/'').read > html = html.clean_markup > html = Hpricot(html)I had a similar problem last week and ended up doing exactly what you are proposing, i.e. a pre-processing step to clean up the HTML before feeding it to Hpricot.> Here is an oversimplified example of froogle''s of malformed markup: > > <table> > <tr> > <td>foo > <td>bar > <tr> > <td>baz > <td>boo > </table>I believe there are Ruby libraries for cleaning up HTML though I''m not familiar with them. Perhaps you could just treat it as a long string and walk over it doing the following: 1. Scan forward until you find a tag (either opening or closing). 2. If the tag is a known potentially-broken one (''<tr>'', ''<th>'', ''<td>'', etc) set a flag for that tag to indicate it is open (or push it onto a per-tag stack somewhere). Clear the flag (or pop the stack) if/when you see the matching closing tag. 3. When you see that tag again, if it hasn''t been closed in the meantime, insert the closing tag yourself and clear your flag (pop your stack). I think it will be easier to do than it sounds ;-) Hope that helps, Andy --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Andrew Stewart wrote:> > On 17 Nov 2006, at 00:06, Alex Wayne wrote: > > > The problem is that the HTML on Froogle is seriously broken. > > Agreed!Disagree! The example given is not malformed. It''s perfectly acceptable HTML 4.01. The end tags for <tr> and <td> can be omitted. Unless the DTD declaration claims it to be something newer than HTML 4.01, it is fine. I would say this is a bug in Hpricot. - Mark. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Take a look at scrapi - if not to actually use then to steal Assaf''s ideas. =) I THINK he has some sort of way to pre-process HTML with Tidy in there; might want to crib those ideas. On 11/16/06, Alex Wayne <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote:> > I am having a complicated issue here. I am trying to fetch a page from > Froogle and parse it via Hpricot to collect data from the products in > the search results. > > sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search > > The problem is that the HTML on Froogle is seriously broken. I need to > get the table row (tr) for each product, and then look in each of that > rows td''s for data. But google''s html is full of unclosed tags for > their tables that makes Hpricot freak out. Hpricot thinks the tr''s are > empty: > > "<tr valign=\"top\">\n</tr>" > > So I guess the question is how do I make Hpricot cope with this markup? > It obviously works great in the browser. Are there any tools that will > convert a string of html to a valid XML or DOM equivalent? It must be > possible because web browsers handle it all the time. > > What I need to be able to do: > > html = open(''http://foo.com/'').read > html = html.clean_markup > html = Hpricot(html) > > --- > > Here is an oversimplified example of froogle''s of malformed markup: > > <table> > <tr> > <td>foo > <td>bar > <tr> > <td>baz > <td>boo > </table> > > -- > Posted via http://www.ruby-forum.com/. > > > >-- I think it is inevitable that people program poorly. Training will not substantially help matters. We have to learn to live with it. -- Alan Perlis --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On Nov 17, 2006, at 9:51 AM, Thomas, Mark - BLS CTR wrote:> Andrew Stewart wrote: >> On 17 Nov 2006, at 00:06, Alex Wayne wrote: >>> The problem is that the HTML on Froogle is seriously broken. >> >> Agreed! > > Disagree! > > The example given is not malformed. It''s perfectly acceptable HTML > 4.01. > The end tags for <tr> and <td> can be omitted. > > Unless the DTD declaration claims it to be something newer than HTML > 4.01, it is fine. > > I would say this is a bug in Hpricot. > > - Mark.You can use RubyfulSoup to deal with HTML even when it isn''t completely correct. It is packaged as a gem, but I unpacked it into the plugin directory and it''s working for me. (Hpricot didn''t exist at the time or I might have tried it.) #Rubyful Soup #Elixir and Tonic #"The Screen-Scraper''s Friend" #v1.0.4 #http://www.crummy.com/software/RubyfulSoup/ # #Rubyful Soup is a port to the Ruby language and idiom of the Python #library Beautiful Soup. #See http://www.crummy.com/software/BeautifulSoup/ for details on the original. -Rob Rob Biedenharn http://agileconsultingllc.com Rob-xa9cJyRlE0mWcWVYNo9pwxS2lgjeYSpx@public.gmane.org --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On 11/17/06, Michael Campbell <michael.campbell-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > > Take a look at scrapi - if not to actually use then to steal Assaf''s > ideas. =) I THINK he has some sort of way to pre-process HTML with > Tidy in there; might want to crib those ideas.We also use tidy for cleaning up invalid xhtml with MasterView project. You can get the ruby tidy wrapper here http://rubyforge.org/projects/tidy http://tidy.rubyforge.org/ (for usage info) Note that it also requires that the tidy library available on the server as well. It is available for both windows and *nix. It works well at cleaning up invalid xhtml and the ruby tidy wrapper is simple to use. The only disadvantage is that you need to have the lib available and you need to set the path to the lib so that it can load it. I wish that could be automated some how, because it is a manual setup step. Jeff --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Jeff Barczewski wrote:> On 11/17/06, Michael Campbell <michael.campbell-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> >> >> Take a look at scrapi - if not to actually use then to steal Assaf''s >> ideas. =) I THINK he has some sort of way to pre-process HTML with >> Tidy in there; might want to crib those ideas. > > > > We also use tidy for cleaning up invalid xhtml with MasterView project. > > You can get the ruby tidy wrapper here > http://rubyforge.org/projects/tidy > http://tidy.rubyforge.org/ (for usage info) > > JeffI seem to be having some luck with tidy and cleaning it before I send it to Hpricot. This little code snippet seems to handle keeping the Tidy.path assigned. I just have to include the linux and windows tody libs in my /lib directory. require ''tidy'' if RUBY_PLATFORM =~ /mswin/ Tidy.path = "#{RAILS_ROOT}/lib/tidy.dll" else Tidy.path = "#{RAILS_ROOT}/lib/tidy" end Thanks for the tip! -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---