nonrecursive-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
2007-Aug-27 14:31 UTC
how can I get malformed UTF-8 characters to display properly?
Hello everyone,
I''m scraping a lot of sites for a project, and occasionally the
scraped content will have "malformed UTF-8" characters. When the
scraped content is processed (basically a database record is created),
these characters often don''t appear as they''re supposed to.
Normally, the following code works great:
str.unpack("U*").collect {|s| (s > 127 ? "&##{s};" :
s.chr) }.join("")
But it won''t work with these "malformed UTF-8" characters. So
I''ve
written the following to handle these characters, but it still isn''t
perfect. For example, I scraped this page:
http://web.mac.com/j3mbeck/iWeb/JohnBeckPaper_Steel/Fireplace%20Surrounds.html
The alt attribute of the first thumbnail, steel surround, contains the
text "Steel has that effect where you''d least expect it". The
''
character shows up as Õ when I use the method below, and the "d" is
just swallowed.
data.gsub!(/\323/, ''"'')
require ''oniguruma''
o = Oniguruma::ORegexp.new(''[^[:ascii:]]'')
# o = Oniguruma::ORegexp.new(''[^[:ascii:]]'', {:encoding
=>
Oniguruma::ENCODING_UTF8})
chars = []
data.each_char{|c|chars << c}
chars.collect do |c|
if o.match c
begin
"&##{c.unpack(''U*'').first};"
rescue ArgumentError
add_log_message("Has malformed UTF-8 characters")
#handling malformed UTF-8 : a huge pain and possibly future
cause of problems
bytes = []
c.each_byte{|b| bytes << b}
# assumes we''re handling at most, 2-byte strings. We have
no way if the malformed character is
# supposed to be one byte or two, but we''re assuming
it''s 1.
["&##{bytes[0]}"] + bytes[1..-1].collect{|b|b.chr}
end
else
c
end
end.flatten.join('''')
Any suggestions?
Thanks!
Daniel
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---
