Hello, I''m looking for an HTML parser that can handle bad formed input (unclosed tags). There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t handle bad formed documents Thanks -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hey nuno,
    Urm, okay, call me stupid Ishmael, but, why not merely subclass the 
current htmlparser and then whenever you get a ''bad tag'' do
whatever you
want to do with it. I dare say that if someone passes me a badly formed 
document, I -want- them to see an error, however whatever -you- decide 
to do with it is upto (well) -you-. If you want to try and
''fix'' certain
errors in a bad document, thats surely down to ''you''
    You may get lucky and someone may have already trod this path, but, 
surely in the case of ''bad data'' your not best placed to say
whats
''valid'' and whats not. surely thats something only the
originating user
can do. Mean to say, you can deal with things like a missing
''>'' fairly
simply, but what about character transposition ? inptu instead of input, 
or character addition <input name="freds"> instead of <input 
name="fred"> ..
    I think the -saniest- thing a parser can do, is raise an error on 
badly formed. Perhaps not the answer you want, and I look forward to 
being proved ''wrong'' but, well, *polite shrug*
there''s my 2c ;p
    Regards
    Stef
nuno wrote:> Hello, I''m looking for an HTML parser that can handle bad formed
input
> (unclosed tags).
>
> There''s a pretty good HTML parser in RoR ActionPack but
it''s doesn''t
> handle bad formed documents
>
>
> Thanks
>
>   
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk
-~----------~----~----~----~------~----~------~--~---
nuno wrote:> Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > > ThanksTry scrapi: http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/ Also, you can use HTMLTidy to clean it up. Personally, I use rubyful_soup but that''s because I had already implemented it before finding out about scrapi. Regards, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
"I dare say that if someone passes me a badly formed document, I -want- them to see an error, however whatever -you- decide to do with it is upto (well) -you-. If you want to try and ''fix'' certain errors in a bad document, thats surely down to ''you''" ****Usually when you are scraping you don''t have control over the content so you have to take what is given to you and do the best you can do with it. I believe HTMLTidy will clean up malformed documents. Regards, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hi, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> writes:> Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags).did you try this one? http://mechanize.rubyforge.org/ -- \ / http://www.hashbang.de \/lad http://www.1-cat.de --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
nuno wrote:> Hello, I''m looking for an HTML parser that > can handle bad formed input (unclosed tags).HTML Tidy might be what you''re looking for. http://www.google.com/search?hl=en&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=html+tidy&spell=1 hth, Bill --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hello Michael,
    Whereas I agree with you in regards to the whole ''you cant control 
someone elses webpage when they dont conform to the standard'', I do 
think that if you scraping a webpage, you don''t really want to fling it
into an HTMLParser anyway. surely its much quicker to treat the html as 
a ''string'' and then regex out what you need ?
    Of course, this is probably either my perl background,rampant 
pragmatism or bad programming showing .. but .. whenever I have wanted 
to check the ''well formed-ness'' of a document, its almost
usually been
''uploaded'' to the system I am using. So, thats where I base my
whole
''fling an error on error'' practice from ;) So, in essence, I
guess it
depends what the user is using the HTMLParser ''for'' :)
    Regards
    Stef
Michael Modica wrote:> "I dare say that if someone passes me a badly formed
> document, I -want- them to see an error, however whatever -you- decide
> to do with it is upto (well) -you-. If you want to try and
''fix'' certain
> errors in a bad document, thats surely down to
''you''"
>
> ****Usually when you are scraping you don''t have control over the 
> content so you have to take what is given to you and do the best you can 
> do with it.  I believe HTMLTidy will clean up malformed documents.
>
> Regards,
>
> Michael
>
>
>   
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk
-~----------~----~----~----~------~----~------~--~---
nuno <rails-mailing-list@...> writes:> > > Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > Thanks >Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML based language, and self-closing tags is invalid. That said, I agree with the person who said it''s better to just treat it a one long string and regex it. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Thanks for your answers ! scrapi seems to be all I need ... -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
On Sep 1, 2006, at 12:39 PM, Gareth Adams wrote:> > nuno <rails-mailing-list@...> writes: > >> >> >> Hello, I''m looking for an HTML parser that can handle bad formed >> input >> (unclosed tags). > Just a technical point: Unclosed tags are _not_ badly formed in > HTML, they are > exactly the _right_ way to do things in HTML. HTML is not supposed > to be an XML > based language, and self-closing tags is invalid.Consider: <ul> <li>a <li>b <li>c <li>d <ul> <li>e <li>f <li>g <li>h (BTW, the OP said ''unclosed tags'' not ''self-closing tags'' (by which I think you mean empty tags)) More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen. Tools like Tidy will turn that example into: <ul> <li>a</li> <li>b</li> <li>c</li> <li>d <ul> <li>e</li> <li>f</li> <li>g</li> <li>h</li> </ul> </li> </ul> which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it. Cheers, Bob ---- Bob Hutchison -- blogs at <http://www.recursive.ca/ hutch/> Recursive Design Inc. -- <http://www.recursive.ca/> Raconteur -- <http://www.raconteur.info/> xampl for Ruby -- <http://rubyforge.org/projects/xampl/> --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Why the Lucky Stiff has a great parser, hpricot http://code.whytheluckystiff.net/hpricot/ If you need to follow links or fill out forms as well, the trunk of mechanize can use hpricot as it''s parser. Deadly combo! joshua On 9/1/06, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote:> > > Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > > Thanks > > -- > Posted via http://www.ruby-forum.com/. > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Google for Rubyful Soup - it''s a port (by the original author) of the excellent Python parser "Beautiful Soup", which is explicitly designed to deal with messy, badly-formed, awkward HTML - ie, the real-world examples of it. On 04/09/06, Joshua Bates <joshuabates-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Why the Lucky Stiff has a great parser, hpricot > http://code.whytheluckystiff.net/hpricot/ > > If you need to follow links or fill out forms as well, the trunk > of mechanize can use hpricot as it''s parser. Deadly combo! > > joshua > > > On 9/1/06, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org > wrote: > > > > Hello, I''m looking for an HTML parser that can handle bad formed input > > (unclosed tags). > > > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > > handle bad formed documents > > > > > > Thanks > > > > -- > > Posted via http://www.ruby-forum.com/ . > > > > > > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---