thr3ads.net - Rails - Hpricot help - parsing malformed HTML [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Alex Wayne

2006-Nov-17 00:06 UTC

Hpricot help - parsing malformed HTML

I am having a complicated issue here.  I am trying to fetch a page from 
Froogle and parse it via Hpricot to collect data from the products in 
the search results.

sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search

The problem is that the HTML on Froogle is seriously broken.  I need to 
get the table row (tr) for each product, and then look in each of that 
rows td''s for data.  But google''s html is full of unclosed
tags for
their tables that makes Hpricot freak out.  Hpricot thinks the tr''s are
empty:

  "<tr valign=\"top\">\n</tr>"

So I guess the question is how do I make Hpricot cope with this markup? 
It obviously works great in the browser.  Are there any tools that will 
convert a string of html to a valid XML or DOM equivalent?  It must be 
possible because web browsers handle it all the time.

What I need to be able to do:

  html = open(''http://foo.com/'').read
  html = html.clean_markup
  html = Hpricot(html)

---

Here is an oversimplified example of froogle''s of malformed markup:

  <table>
  <tr>
  <td>foo
  <td>bar
  <tr>
  <td>baz
  <td>boo
  </table>

-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Alex Wayne

2006-Nov-17 00:12 UTC

head link

Re: Hpricot help - parsing malformed HTML

Alex Wayne wrote:> I am having a complicated issue here.  I am trying to fetch a page from 
> Froogle and parse it via Hpricot to collect data from the products in 
> the search results.
> 
> sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search
> 
> The problem is that the HTML on Froogle is seriously broken.  I need to 
> get the table row (tr) for each product, and then look in each of that 
> rows td''s for data.  But google''s html is full of
unclosed tags for
> their tables that makes Hpricot freak out.  Hpricot thinks the
tr''s are
> empty:
> 
>   "<tr valign=\"top\">\n</tr>"
> 
Heres a better illustration of the problem, from irb:

pp
Hpricot(''<table><tr><td>foo<td>bar</table>'')
# => #<Hpricot::Doc
 {elem
  <table>
  {emptyelem <tr>}
  {elem <td> {text "foo"}}
  {elem <td> {text "bar"}}
  </table>}>

the <tr> is empty, and the <td>''s are considered direct
children of
<table>.  So the selector "table tr td" wont work.  There is no
way to
groud td''s by row in this case.

-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Andrew Stewart

2006-Nov-17 08:37 UTC

head link

Re: Hpricot help - parsing malformed HTML

On 17 Nov 2006, at 00:06, Alex Wayne wrote:
> The problem is that the HTML on Froogle is seriously broken.
Agreed!

> So I guess the question is how do I make Hpricot cope with this  
> markup?
> It obviously works great in the browser.  Are there any tools that  
> will
> convert a string of html to a valid XML or DOM equivalent?  It must be
> possible because web browsers handle it all the time.
>
> What I need to be able to do:
>
>   html = open(''http://foo.com/'').read
>   html = html.clean_markup
>   html = Hpricot(html)
I had a similar problem last week and ended up doing exactly what you  
are proposing, i.e. a pre-processing step to clean up the HTML before  
feeding it to Hpricot.

> Here is an oversimplified example of froogle''s of malformed
markup:
>
>   <table>
>   <tr>
>   <td>foo
>   <td>bar
>   <tr>
>   <td>baz
>   <td>boo
>   </table>
I believe there are Ruby libraries for cleaning up HTML though I''m  
not familiar with them.  Perhaps you could just treat it as a long  
string and walk over it doing the following:

1.  Scan forward until you find a tag (either opening or closing).
2.  If the tag is a known potentially-broken one
(''<tr>'', ''<th>'',
''<td>'', etc) set a flag for that tag to indicate it is
open (or push
it onto a per-tag stack somewhere).  Clear the flag (or pop the  
stack) if/when you see the matching closing tag.
3.  When you see that tag again, if it hasn''t been closed in the  
meantime, insert the closing tag yourself and clear your flag (pop  
your stack).

I think it will be easier to do than it sounds ;-)

Hope that helps,
Andy

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Thomas, Mark - BLS CTR

2006-Nov-17 14:51 UTC

head link

Re: Hpricot help - parsing malformed HTML

Andrew Stewart wrote:> 
> On 17 Nov 2006, at 00:06, Alex Wayne wrote:
> 
> > The problem is that the HTML on Froogle is seriously broken.
> 
> Agreed!Disagree!
The example given is not malformed. It''s perfectly acceptable HTML
4.01.
The end tags for <tr> and <td> can be omitted.
Unless the DTD declaration claims it to be something newer than HTML
4.01, it is fine.
I would say this is a bug in Hpricot.
- Mark.

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Michael Campbell

2006-Nov-17 15:06 UTC

head link

Re: Hpricot help - parsing malformed HTML

Take a look at scrapi - if not to actually use then to steal Assaf''s
ideas.  =)  I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.

On 11/16/06, Alex Wayne
<rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>
wrote:>
> I am having a complicated issue here.  I am trying to fetch a page from
> Froogle and parse it via Hpricot to collect data from the products in
> the search results.
>
> sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search
>
> The problem is that the HTML on Froogle is seriously broken.  I need to
> get the table row (tr) for each product, and then look in each of that
> rows td''s for data.  But google''s html is full of
unclosed tags for
> their tables that makes Hpricot freak out.  Hpricot thinks the
tr''s are
> empty:
>
>   "<tr valign=\"top\">\n</tr>"
>
> So I guess the question is how do I make Hpricot cope with this markup?
> It obviously works great in the browser.  Are there any tools that will
> convert a string of html to a valid XML or DOM equivalent?  It must be
> possible because web browsers handle it all the time.
>
> What I need to be able to do:
>
>   html = open(''http://foo.com/'').read
>   html = html.clean_markup
>   html = Hpricot(html)
>
> ---
>
> Here is an oversimplified example of froogle''s of malformed
markup:
>
>   <table>
>   <tr>
>   <td>foo
>   <td>bar
>   <tr>
>   <td>baz
>   <td>boo
>   </table>
>
> --
> Posted via http://www.ruby-forum.com/.
>
> >
>

-- 

I think it is inevitable that people program poorly. Training will not
substantially help matters. We have to learn to live with it. -- Alan
Perlis

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Rob Biedenharn

2006-Nov-17 15:06 UTC

head link

Re: Hpricot help - parsing malformed HTML

On Nov 17, 2006, at 9:51 AM, Thomas, Mark - BLS CTR
wrote:> Andrew Stewart wrote:
>> On 17 Nov 2006, at 00:06, Alex Wayne wrote:
>>> The problem is that the HTML on Froogle is seriously broken.
>>
>> Agreed!
>
> Disagree!
>
> The example given is not malformed. It''s perfectly acceptable HTML
> 4.01.
> The end tags for <tr> and <td> can be omitted.
>
> Unless the DTD declaration claims it to be something newer than HTML
> 4.01, it is fine.
>
> I would say this is a bug in Hpricot.
>
> - Mark.

You can use RubyfulSoup to deal with HTML even when it isn''t  
completely correct.  It is packaged as a gem, but I unpacked it into  
the plugin directory and it''s working for me.  (Hpricot didn''t
exist
at the time or I might have tried it.)

#Rubyful Soup
#Elixir and Tonic
#"The Screen-Scraper''s Friend"
#v1.0.4
#http://www.crummy.com/software/RubyfulSoup/
#
#Rubyful Soup is a port to the Ruby language and idiom of the Python
#library Beautiful Soup.
#See http://www.crummy.com/software/BeautifulSoup/ for details on the  
original.

-Rob

Rob Biedenharn		http://agileconsultingllc.com
Rob-xa9cJyRlE0mWcWVYNo9pwxS2lgjeYSpx@public.gmane.org

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Jeff Barczewski

2006-Nov-17 15:40 UTC

head link

Re: Hpricot help - parsing malformed HTML

On 11/17/06, Michael Campbell
<michael.campbell-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:>
>
> Take a look at scrapi - if not to actually use then to steal
Assaf''s
> ideas.  =)  I THINK he has some sort of way to pre-process HTML with
> Tidy in there; might want to crib those ideas.

We also use tidy for cleaning up invalid xhtml with MasterView project.

You can get the ruby tidy wrapper here
http://rubyforge.org/projects/tidy
http://tidy.rubyforge.org/ (for usage info)

Note that it also requires that the tidy library available on the server as
well. It is available for both windows and *nix.

It works well at cleaning up invalid xhtml and the ruby tidy wrapper is
simple to use. The only disadvantage is that you need to have the lib
available and you need to set the path to the lib so that it can load it. I
wish that could be automated some how, because it is a manual setup step.

Jeff

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Alex Wayne

2006-Nov-17 17:21 UTC

head link

Re: Hpricot help - parsing malformed HTML

Jeff Barczewski wrote:> On 11/17/06, Michael Campbell
<michael.campbell-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>
>> Take a look at scrapi - if not to actually use then to steal
Assaf''s
>> ideas.  =)  I THINK he has some sort of way to pre-process HTML with
>> Tidy in there; might want to crib those ideas.
> 
> 
> 
> We also use tidy for cleaning up invalid xhtml with MasterView project.
> 
> You can get the ruby tidy wrapper here
> http://rubyforge.org/projects/tidy
> http://tidy.rubyforge.org/ (for usage info)
> 
> Jeff
I seem to be having some luck with tidy and cleaning it before I send it 
to Hpricot.

This little code snippet seems to handle keeping the Tidy.path assigned. 
I just have to include the linux and windows tody libs in my /lib 
directory.

  require ''tidy''
  if RUBY_PLATFORM =~ /mswin/
    Tidy.path = "#{RAILS_ROOT}/lib/tidy.dll"
  else
    Tidy.path = "#{RAILS_ROOT}/lib/tidy"
  end

Thanks for the tip!

-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Rails - Nov 2006 - Hpricot help - parsing malformed HTML

Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML

Re: Hpricot help - parsing malformed HTML