nonrecursive-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
2007-Aug-27 14:31 UTC
how can I get malformed UTF-8 characters to display properly?
Hello everyone, I''m scraping a lot of sites for a project, and occasionally the scraped content will have "malformed UTF-8" characters. When the scraped content is processed (basically a database record is created), these characters often don''t appear as they''re supposed to. Normally, the following code works great: str.unpack("U*").collect {|s| (s > 127 ? "&##{s};" : s.chr) }.join("") But it won''t work with these "malformed UTF-8" characters. So I''ve written the following to handle these characters, but it still isn''t perfect. For example, I scraped this page: http://web.mac.com/j3mbeck/iWeb/JohnBeckPaper_Steel/Fireplace%20Surrounds.html The alt attribute of the first thumbnail, steel surround, contains the text "Steel has that effect where you''d least expect it". The '' character shows up as Õ when I use the method below, and the "d" is just swallowed. data.gsub!(/\323/, ''"'') require ''oniguruma'' o = Oniguruma::ORegexp.new(''[^[:ascii:]]'') # o = Oniguruma::ORegexp.new(''[^[:ascii:]]'', {:encoding => Oniguruma::ENCODING_UTF8}) chars = [] data.each_char{|c|chars << c} chars.collect do |c| if o.match c begin "&##{c.unpack(''U*'').first};" rescue ArgumentError add_log_message("Has malformed UTF-8 characters") #handling malformed UTF-8 : a huge pain and possibly future cause of problems bytes = [] c.each_byte{|b| bytes << b} # assumes we''re handling at most, 2-byte strings. We have no way if the malformed character is # supposed to be one byte or two, but we''re assuming it''s 1. ["&##{bytes[0]}"] + bytes[1..-1].collect{|b|b.chr} end else c end end.flatten.join('''') Any suggestions? Thanks! Daniel --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---