Jeffrey L. Taylor
2009-Apr-03 01:46 UTC
Recovering in Ruby-libxml parser from invalid UTF8 code
I am parsing XML streams with ruby-libxml using the XML::Reader class. Several have invalid UTF-8 characters. I need a tutorial or at least some hints on how to recover and continue the parsing. TIA, Jeffrey --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Jeffrey L. Taylor wrote:> I am parsing XML streams with ruby-libxml using the XML::Reader class. > Several have invalid UTF-8 characters. I need a tutorial or at least some > hints on how to recover and continue the parsing.Why not scrub them with Ruby''s built-in iconv first? And what are they doing to you and ruby-libxml? I have found libxml2 suspiciously forgiving, so far... --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Jeffrey L. Taylor
2009-Apr-03 02:25 UTC
Re: Recovering in Ruby-libxml parser from invalid UTF8 code
Quoting Phlip <phlip2005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:> > Jeffrey L. Taylor wrote: > > I am parsing XML streams with ruby-libxml using the XML::Reader class. > > Several have invalid UTF-8 characters. I need a tutorial or at least some > > hints on how to recover and continue the parsing. > > Why not scrub them with Ruby''s built-in iconv first? > > And what are they doing to you and ruby-libxml? I have found libxml2 > suspiciously forgiving, so far... >Throws an exception. It took a bunch of digging to find line: 835, character: 418 is truely not an UTF-8 character (octal 240, maybe a Latin-1 character?). I''d like to delete or replace it with a question mark and continue parsing. It is a rather large file so I''d rather not read the whole thing into memory to correct. I suppose I could wrap the read function in a clean up function. Messy trying to keep state for UTF-8 across partial reads. I was hoping for something better. Jeffrey --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---