thr3ads.net - Rails - Converting uploaded HTML files into UTF8 [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Carl Youngblood

2005-Nov-19 17:38 UTC

Converting uploaded HTML files into UTF8

I''m writing a rails app that allows an admin to upload files that will
be searched by users.  These files may be in text or html format, and
frequently are in different charsets.  For uniform presentation, I''d
like to convert everything to UTF8.  However, I''m not sure how best to
detect the format the uploaded docs are in.  I noticed that the ruby
iconv library needs to know what format you are converting from.  Does
anyone have any good ways to detect a document''s charset?

While we''re on the topic of conversion, I''ve also written a
simple
html_to_text conversion routine:

  def html_to_text(html)
    html.gsub!(/<\s*?script[^>]*?>.*?<\s*?\/script\s*?>/m,
'''')  #
remove javascript
    html.gsub!(/<[\/\!]*?[^<>]*?>/m, '''')          
#
remove html tags
    html.gsub!(/([\r\n])[\s]+/m, ''\1'')                        
#
remove white space
    html.gsub!(/&(quot|\#34);/m, ''"'')                
#
convert symbols
    html.gsub!(/&(amp|\#38);/m, ''&'')
    html.gsub!(/&(lt|\#60);/m, ''<'')
    html.gsub!(/&(gt|\#62);/m, ''>'')
    html.gsub!(/&(nbsp|\#160);/m, '' '')
    html.gsub!(/&(iexcl|\#161);/m, "\161")
    html.gsub!(/&(cent|\#162);/m, "\162")
    html.gsub!(/&(pound|\#163);/m, "\163")
    html.gsub!(/&(copy|\#169);/m, "\169")
    html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'') }
    html.strip!
    html
  end

Is there a better ruby library for this that tries to preserve
structure more, or should I just stick with this approach?

Thanks,
Carl

Julian ''Julik'' Tarkhanov

2005-Nov-19 18:29 UTC

head link

Re: Converting uploaded HTML files into UTF8

On 19-nov-2005, at 18:38, Carl Youngblood wrote:
> I''m writing a rails app that allows an admin to upload files that
will
> be searched by users.  These files may be in text or html format, and
> frequently are in different charsets.  For uniform presentation,
I''d
> like to convert everything to UTF8.  However, I''m not sure how
best to
> detect the format the uploaded docs are in.  I noticed that the ruby
> iconv library needs to know what format you are converting from.  Does
> anyone have any good ways to detect a document''s charset?No, there is no such way. This is one of the reasons Unicode has been  
designed
- one of it''s special "perks" is that you can reasonably well
detect
if a document is in Unicode
( the probability of the random sequence of bytes matching Unicode is  
very low).

The only thing you _can_ do is to say whether a document IS in  
Unicode (you have to know the form though) or not.
After that you need to fallback to a charset that you are most likely  
to expect on input.

What I would do if I was you:

1. In the upload form, make a checkmark that says "This file uses  
Unicode" - people who need to know DO know what it means
2. By default turn the checkmark on
3. When the file is uploaded convert it from ISO to Unicode if the  
checkmark was unchecked. If the HTML document contains a charset  
directive,
decode it according to this directive. You can also use the UTF-8  
sanity regex - if the uploaded document matches it you can save it as  
UTF without conversions.

> While we''re on the topic of conversion, I''ve also written
a simple
> html_to_text conversion routine:
>
>   def html_to_text(html)
>     html.gsub!(/<\s*?script[^>]*?>.*?<\s*?\/script\s*?>/m,
'''')  #
> remove javascript
>     html.gsub!(/<[\/\!]*?[^<>]*?>/m, '''')     
#
> remove html tags
>     html.gsub!(/([\r\n])[\s]+/m, ''\1'')                   
#
> remove white space
>     html.gsub!(/&(quot|\#34);/m, ''"'')           
#
> convert symbols
>     html.gsub!(/&(amp|\#38);/m, ''&'')
>     html.gsub!(/&(lt|\#60);/m, ''<'')
>     html.gsub!(/&(gt|\#62);/m, ''>'')
>     html.gsub!(/&(nbsp|\#160);/m, '' '')
>     html.gsub!(/&(iexcl|\#161);/m, "\161")
>     html.gsub!(/&(cent|\#162);/m, "\162")
>     html.gsub!(/&(pound|\#163);/m, "\163")
>     html.gsub!(/&(copy|\#169);/m, "\169")
>     html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'')
}
>     html.strip!
>     html
>   end
>
> Is there a better ruby library for this that tries to preserve
> structure more, or should I just stick with this approach?
Look in the tag removal routines in the Rails source. But I think we  
currently don''t have such a library - partly because it is unclear
how you can handle HTML specifics (such as tables etc.)
>     html.gsub!(/&\#(\d+);/m) {|s| [$1.to_i].pack(''c'')
}
So you say that you want your documents to be unicode and you do  
this? Strange.

Carl Youngblood

2005-Nov-19 19:18 UTC

head link

Re: Converting uploaded HTML files into UTF8

Thanks for the tips.  I can''t rely on the uploaded documents having
similar HTML format, so I convert them to plain text files first.  As
I understand it, even after converting to plaintext, it would still be
possible to have characters that didn''t get displayed properly because
they used a different charset, so therefore I want to convert to UTF8
after stripping out all HTML tags.  Please let me know me if this is
an incorrect assumption.

On 11/19/05, Julian ''Julik'' Tarkhanov
<listbox-RY+snkucC20@public.gmane.org> wrote:> >     html.gsub!(/&\#(\d+);/m) {|s|
[$1.to_i].pack(''c'') }
>
> So you say that you want your documents to be unicode and you do
> this? Strange.

Julian ''Julik'' Tarkhanov

2005-Nov-19 20:05 UTC

head link

Re: Converting uploaded HTML files into UTF8

On 19-nov-2005, at 20:18, Carl Youngblood wrote:
> Thanks for the tips.  I can''t rely on the uploaded documents
having
> similar HTML format, so I convert them to plain text files first.  As
> I understand it, even after converting to plaintext, it would still be
> possible to have characters that didn''t get displayed properly
because
> they used a different charset, so therefore I want to convert to UTF8
> after stripping out all HTML tags.  Please let me know me if this is
> an incorrect assumption.Hmm. No this assumption is not correct. FIrst you have to get your  
text into a right encoding and then
perform transforms with it.
>
> On 11/19/05, Julian ''Julik'' Tarkhanov
<listbox-RY+snkucC20@public.gmane.org> wrote:
>>>     html.gsub!(/&\#(\d+);/m) {|s|
[$1.to_i].pack(''c'') }
What I meant when quoting this fragment is that this kind of  
conversion will get you only ASCII
character. You shoudl use UTF pack ("U") to get all the chars.

Rails - Nov 2005 - Converting uploaded HTML files into UTF8

Converting uploaded HTML files into UTF8

Re: Converting uploaded HTML files into UTF8

Re: Converting uploaded HTML files into UTF8

Re: Converting uploaded HTML files into UTF8