Hello, I''m looking for an HTML parser that can handle bad formed input (unclosed tags). There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t handle bad formed documents Thanks -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hey nuno, Urm, okay, call me stupid Ishmael, but, why not merely subclass the current htmlparser and then whenever you get a ''bad tag'' do whatever you want to do with it. I dare say that if someone passes me a badly formed document, I -want- them to see an error, however whatever -you- decide to do with it is upto (well) -you-. If you want to try and ''fix'' certain errors in a bad document, thats surely down to ''you'' You may get lucky and someone may have already trod this path, but, surely in the case of ''bad data'' your not best placed to say whats ''valid'' and whats not. surely thats something only the originating user can do. Mean to say, you can deal with things like a missing ''>'' fairly simply, but what about character transposition ? inptu instead of input, or character addition <input name="freds"> instead of <input name="fred"> .. I think the -saniest- thing a parser can do, is raise an error on badly formed. Perhaps not the answer you want, and I look forward to being proved ''wrong'' but, well, *polite shrug* there''s my 2c ;p Regards Stef nuno wrote:> Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > > Thanks > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
nuno wrote:> Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > > ThanksTry scrapi: http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/ Also, you can use HTMLTidy to clean it up. Personally, I use rubyful_soup but that''s because I had already implemented it before finding out about scrapi. Regards, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
"I dare say that if someone passes me a badly formed document, I -want- them to see an error, however whatever -you- decide to do with it is upto (well) -you-. If you want to try and ''fix'' certain errors in a bad document, thats surely down to ''you''" ****Usually when you are scraping you don''t have control over the content so you have to take what is given to you and do the best you can do with it. I believe HTMLTidy will clean up malformed documents. Regards, Michael -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hi, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> writes:> Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags).did you try this one? http://mechanize.rubyforge.org/ -- \ / http://www.hashbang.de \/lad http://www.1-cat.de --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
nuno wrote:> Hello, I''m looking for an HTML parser that > can handle bad formed input (unclosed tags).HTML Tidy might be what you''re looking for. http://www.google.com/search?hl=en&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=html+tidy&spell=1 hth, Bill --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Hello Michael, Whereas I agree with you in regards to the whole ''you cant control someone elses webpage when they dont conform to the standard'', I do think that if you scraping a webpage, you don''t really want to fling it into an HTMLParser anyway. surely its much quicker to treat the html as a ''string'' and then regex out what you need ? Of course, this is probably either my perl background,rampant pragmatism or bad programming showing .. but .. whenever I have wanted to check the ''well formed-ness'' of a document, its almost usually been ''uploaded'' to the system I am using. So, thats where I base my whole ''fling an error on error'' practice from ;) So, in essence, I guess it depends what the user is using the HTMLParser ''for'' :) Regards Stef Michael Modica wrote:> "I dare say that if someone passes me a badly formed > document, I -want- them to see an error, however whatever -you- decide > to do with it is upto (well) -you-. If you want to try and ''fix'' certain > errors in a bad document, thats surely down to ''you''" > > ****Usually when you are scraping you don''t have control over the > content so you have to take what is given to you and do the best you can > do with it. I believe HTMLTidy will clean up malformed documents. > > Regards, > > Michael > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
nuno <rails-mailing-list@...> writes:> > > Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > Thanks >Just a technical point: Unclosed tags are _not_ badly formed in HTML, they are exactly the _right_ way to do things in HTML. HTML is not supposed to be an XML based language, and self-closing tags is invalid. That said, I agree with the person who said it''s better to just treat it a one long string and regex it. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Thanks for your answers ! scrapi seems to be all I need ... -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
On Sep 1, 2006, at 12:39 PM, Gareth Adams wrote:> > nuno <rails-mailing-list@...> writes: > >> >> >> Hello, I''m looking for an HTML parser that can handle bad formed >> input >> (unclosed tags). > Just a technical point: Unclosed tags are _not_ badly formed in > HTML, they are > exactly the _right_ way to do things in HTML. HTML is not supposed > to be an XML > based language, and self-closing tags is invalid.Consider: <ul> <li>a <li>b <li>c <li>d <ul> <li>e <li>f <li>g <li>h (BTW, the OP said ''unclosed tags'' not ''self-closing tags'' (by which I think you mean empty tags)) More importantly, this illustrates an ambiguity that makes dealing with ill-formed html difficult, even with a regex. What was meant? a nested list or two separate lists? indentation suggests one thing, but a peak in a browser another. But surely the author looked at the page in the browser and saw that it was okay. Right, surely. But with a little CSS who knows what was seen. Tools like Tidy will turn that example into: <ul> <li>a</li> <li>b</li> <li>c</li> <li>d <ul> <li>e</li> <li>f</li> <li>g</li> <li>h</li> </ul> </li> </ul> which is probably how a browser would interpret it. Some of the other tools will do something similar when parsing it. Cheers, Bob ---- Bob Hutchison -- blogs at <http://www.recursive.ca/ hutch/> Recursive Design Inc. -- <http://www.recursive.ca/> Raconteur -- <http://www.raconteur.info/> xampl for Ruby -- <http://rubyforge.org/projects/xampl/> --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Why the Lucky Stiff has a great parser, hpricot http://code.whytheluckystiff.net/hpricot/ If you need to follow links or fill out forms as well, the trunk of mechanize can use hpricot as it''s parser. Deadly combo! joshua On 9/1/06, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote:> > > Hello, I''m looking for an HTML parser that can handle bad formed input > (unclosed tags). > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > handle bad formed documents > > > Thanks > > -- > Posted via http://www.ruby-forum.com/. > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---
Google for Rubyful Soup - it''s a port (by the original author) of the excellent Python parser "Beautiful Soup", which is explicitly designed to deal with messy, badly-formed, awkward HTML - ie, the real-world examples of it. On 04/09/06, Joshua Bates <joshuabates-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Why the Lucky Stiff has a great parser, hpricot > http://code.whytheluckystiff.net/hpricot/ > > If you need to follow links or fill out forms as well, the trunk > of mechanize can use hpricot as it''s parser. Deadly combo! > > joshua > > > On 9/1/06, nuno <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org > wrote: > > > > Hello, I''m looking for an HTML parser that can handle bad formed input > > (unclosed tags). > > > > There''s a pretty good HTML parser in RoR ActionPack but it''s doesn''t > > handle bad formed documents > > > > > > Thanks > > > > -- > > Posted via http://www.ruby-forum.com/ . > > > > > > > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk -~----------~----~----~----~------~----~------~--~---