I''m writing a rails app that allows an admin to upload files that will be searched by users. These files may be in text or html format, and frequently are in different charsets. For uniform presentation, I''d like to convert everything to UTF8. However, I''m not sure how best to detect the format the uploaded docs are in. I noticed that the ruby iconv library needs to know what format you are converting from. Does anyone have any good ways to detect a document''s charset? While we''re on the topic of conversion, I''ve also written a simple html_to_text conversion routine: def html_to_text(html) html.gsub!(/<\s*?script[^>]*?>.*?<\s*?\/script\s*?>/m, '''') # remove javascript html.gsub!(/<[\/\!]*?[^<>]*?>/m, '''') # remove html tags html.gsub!(/([\r\n])[\s]+/m, ''\1'') # remove white space html.gsub!(/&(quot|\#34);/m, ''"'') # convert symbols html.gsub!(/&(amp|\#38);/m, ''&'') html.gsub!(/&(lt|\#60);/m, ''<'') html.gsub!(/&(gt|\#62);/m, ''>'') html.gsub!(/&(nbsp|\#160);/m, '' '') html.gsub!(/&(iexcl|\#161);/m, "\161") html.gsub!(/&(cent|\#162);/m, "\162") html.gsub!(/&(pound|\#163);/m, "\163") html.gsub!(/&(copy|\#169);/m, "\169") html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') } html.strip! html end Is there a better ruby library for this that tries to preserve structure more, or should I just stick with this approach? Thanks, Carl
Julian ''Julik'' Tarkhanov
2005-Nov-19 18:29 UTC
Re: Converting uploaded HTML files into UTF8
On 19-nov-2005, at 18:38, Carl Youngblood wrote:> I''m writing a rails app that allows an admin to upload files that will > be searched by users. These files may be in text or html format, and > frequently are in different charsets. For uniform presentation, I''d > like to convert everything to UTF8. However, I''m not sure how best to > detect the format the uploaded docs are in. I noticed that the ruby > iconv library needs to know what format you are converting from. Does > anyone have any good ways to detect a document''s charset?No, there is no such way. This is one of the reasons Unicode has been designed - one of it''s special "perks" is that you can reasonably well detect if a document is in Unicode ( the probability of the random sequence of bytes matching Unicode is very low). The only thing you _can_ do is to say whether a document IS in Unicode (you have to know the form though) or not. After that you need to fallback to a charset that you are most likely to expect on input. What I would do if I was you: 1. In the upload form, make a checkmark that says "This file uses Unicode" - people who need to know DO know what it means 2. By default turn the checkmark on 3. When the file is uploaded convert it from ISO to Unicode if the checkmark was unchecked. If the HTML document contains a charset directive, decode it according to this directive. You can also use the UTF-8 sanity regex - if the uploaded document matches it you can save it as UTF without conversions.> While we''re on the topic of conversion, I''ve also written a simple > html_to_text conversion routine: > > def html_to_text(html) > html.gsub!(/<\s*?script[^>]*?>.*?<\s*?\/script\s*?>/m, '''') # > remove javascript > html.gsub!(/<[\/\!]*?[^<>]*?>/m, '''') # > remove html tags > html.gsub!(/([\r\n])[\s]+/m, ''\1'') # > remove white space > html.gsub!(/&(quot|\#34);/m, ''"'') # > convert symbols > html.gsub!(/&(amp|\#38);/m, ''&'') > html.gsub!(/&(lt|\#60);/m, ''<'') > html.gsub!(/&(gt|\#62);/m, ''>'') > html.gsub!(/&(nbsp|\#160);/m, '' '') > html.gsub!(/&(iexcl|\#161);/m, "\161") > html.gsub!(/&(cent|\#162);/m, "\162") > html.gsub!(/&(pound|\#163);/m, "\163") > html.gsub!(/&(copy|\#169);/m, "\169") > html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') } > html.strip! > html > end > > Is there a better ruby library for this that tries to preserve > structure more, or should I just stick with this approach?Look in the tag removal routines in the Rails source. But I think we currently don''t have such a library - partly because it is unclear how you can handle HTML specifics (such as tables etc.)> html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') }So you say that you want your documents to be unicode and you do this? Strange.
Thanks for the tips. I can''t rely on the uploaded documents having similar HTML format, so I convert them to plain text files first. As I understand it, even after converting to plaintext, it would still be possible to have characters that didn''t get displayed properly because they used a different charset, so therefore I want to convert to UTF8 after stripping out all HTML tags. Please let me know me if this is an incorrect assumption. On 11/19/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote:> > html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') } > > So you say that you want your documents to be unicode and you do > this? Strange.
Julian ''Julik'' Tarkhanov
2005-Nov-19 20:05 UTC
Re: Converting uploaded HTML files into UTF8
On 19-nov-2005, at 20:18, Carl Youngblood wrote:> Thanks for the tips. I can''t rely on the uploaded documents having > similar HTML format, so I convert them to plain text files first. As > I understand it, even after converting to plaintext, it would still be > possible to have characters that didn''t get displayed properly because > they used a different charset, so therefore I want to convert to UTF8 > after stripping out all HTML tags. Please let me know me if this is > an incorrect assumption.Hmm. No this assumption is not correct. FIrst you have to get your text into a right encoding and then perform transforms with it.> > On 11/19/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote: >>> html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') }What I meant when quoting this fragment is that this kind of conversion will get you only ASCII character. You shoudl use UTF pack ("U") to get all the chars.