thr3ads.net - Rails core - Overview of Ruby 1.9 encoding problem tickets [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Czarek

2010-Apr-19 13:58 UTC

Overview of Ruby 1.9 encoding problem tickets

SUMMARY:
--------

I tried to identify the general and root causes for these problems
with 1.9, by taking into account non-utf encoding, current patches,
comments and ideas. I used ticket #2188 as base for explanations.

This is a long read. I wanted to include all the relevant information
in one place. I also included information about related tickets in LH
and their status. I decided that adding parts of this to LH would just
add to the confusion.

Two patches are included (one is from Andrew Grim) that should fix one
issue (#2188) in a way, that fixes the problem and doesn''t break
anything. Two small steps for Rails, one giant step for proper
encoding support. I hope.

I welcome any feedback that would help get Rails closer to fully
supporting Ruby 1.9 and vice-versa.

SOLUTION:
---------

The general idea is: allow only one "internal" encoding in Rails at
any given time, based on the default Ruby encoding (or configurable).

And treat any incoming external strings that cannot be converted to
this "internal" encoding as errors in the gems, which they occur. And
possibly report mismatches before they even "enter" Rails, by
attempting to convert them into the "internal" encoding immediately.

As a result of enforcing this, all Rails tests should work with any
encoding, that is a superset of the encodings used for input (db,
Rack, ERB, Haml, ...) in a given environment.

With a optimal setup (db encoding, Ruby encoding, Rack encoding
settings, I18n translations, ...), no transcoding will occur during
the rendering process, no matter what the default Rails encoding is
used (including ASCII_8BIT), and no force_encoding would be needed
internally in Rails, except as workarounds for gems and libraries
where this is difficult otherwise.

The guideline for gem and plugin developers would be: do not create or
return strings (other than internal use) that are not compatible with
the default encoding both ways.

In some cases, it may be acceptable to drop or escape characters that
cannot be transcoded (maybe Rack input, for example).

The idea is based on:

- Jeremy Kemper''s strong attitude toward avoiding solutions
requiring UTF-8 as default or forcing it

- Yehuda''s opinion about using UTF-8 as default in Ruby instead of
ASCII-8BIT

- James Edward Gray''s solution for encoding issues in CSV

- the multitude of ways to set the encoding in Ruby

- giving everyone the liberty to use any encoding they want for any
task, without the need of porting and modifying existing code if
possible

- personal experience with many encoding pitfalls

For those interested in Ruby encoding support, I very much recommend
the extremely well written in-depth article by James Edward Gray II:

http://blog.grayproductions.net/articles/understanding_m17n

Results of "Please do investigate":
----------------------------------

The ticket:

#2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates

Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an
alias for "BINARY". This is actually ok, except for the way Ruby 1.9
handles concat with a non-BINARY string, e.g. UTF-8:

>>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and
UTF-8

Although the following works (equivalent to how Ruby 1.8 works):

>>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY''))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"

The surprise is that it "sometimes works", when a string contains only
valid ASCII-7
characters, giving the impression that a patch fixed the problem:

>>
''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
=> "abc語"

(I used force_encoding here for consistency in different locale
settings).

Solutions that come into mind:
-----------------------------

1. force_encoding should not be used, unless really necessary, and
this rule should be applied to ERB. Unfortunately, I have no idea
why ERB uses force_encoding, but I can come up with a few reasons,
the main one being: Rails uses ERB (a general lib) for a specific
purpose and requiring a non-ASCII-8BIT encoding is just as specific.
I would really like an opinion on this.

2. Don''t use ERB. AFAIK, this is why Rails 3.0 works.

3. Treat everything as binary, since the resulting file is sent to a
browser, which will detect the encoding anyway. This is also doesn''t
affect performance, but it ruins the whole idea of having encoding
support, possibly breaking test frameworks instead.

4. Force UTF-8. This is the brute-force idea used in many patches
and workarounds, and this prevents commits from happening. People
should have a right to use non-utf8 ERB files and render in any
encoding e.g. EUC-JP.

5. Try to be intelligent, and guess. This means handling
everything, except BINARY. The problem is how do we know what
encoding to use for template input? And what encoding do we use for
output?

Solution 1 would be best, but with force_encoding already in the wild
with Ruby 1.9, including ruby-head. So that leaves solution 5. Option
3 is a way to get Ruby 1.9 to behave more like 1.8, but will require
all template input strings to be set to BINARY.

Solution 5
----------

force_encoding has to be used at least once somewhere in
Rails - to fix what ERB "breaks", but on what basis should the
encoding be selected? For performance, there should be no
transcoding during rendering, unless absolutely necessary.

When we think about it, the output depends on what we want the
browser to receive, and that is why many people are pushing UTF-8:
the layout usually has UTF-8 anyway, and it would otherwise have to
be parsed to get the encoding from the content-type value.

The input using in rendering a template is a mixture of what web
designers provide, the translators use, the databases return and
Rack emits, among other things.

The policy in Rails could be: "don''t allow multiple encodings
during template rendering". I believe the effort required to do
otherwise is not be justified.

This would force other gem developers to provide a way to set or
read the correct encoding they use or stick with the current
default. In this case (#2188), ERB has to either provide a way to
either return the result in a encoding specified by Rails, or the
ERB handler should be adapted to provide this functionality.

The problem with this: ERB templates do not have an embedded
encoding. Which means we need a way to specify the encoding used in
the template.

Andrew Grim fixes this in his patch here:

https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff

I am only worried about the default case, when no encoding is set.
"ASCII_8BIT", the result of ERB, is not acceptable, unless the
"internal" encoding would also be BINARY. I would propose merging the
following with the patch above:

def compile(template)
input = "<% __in_erb_template=true %>#{template.source}"
src = ::ERB.new(input, nil, erb_trim_mode,
''@output_buffer'').src

if RUBY_VERSION >= ''1.9'' and src.encoding !=
input.encoding
if src.encoding == Encoding::ASCII_8BIT
src = src.force_encoding(input.encoding) #ERB workaround
else
src = src.encode(input.encoding)
end
end

# Ruby 1.9 prepends an encoding to the source. However this is
# useless because you can only set an encoding on the first line
RUBY_VERSION >= ''1.9'' ? src.sub(/\A#coding:.*\n/,
'''') : src
end

And here is an example test case, similar to many others already in
the tickets, which shows the issue:

<%= "日本" %><%=
"語".force_encoding("UTF-8") %>

A few things here to note (for both patches put together):

- the fallback encoding would be assumed to be the same as ruby
default, which can be set by the locale, RUBYOPT with -K option,
or using Encoding.default_*. I believe this is sufficient
flexibility.

- note that there are no assumptions regarding the charset and the
ASCII_8BIT case is handled with this in mind

- obviously, test cases would be executed with different Ruby
encoding defaults - testing one setup no longer guarantees
anything. Rails tests should work with almost any default
encoding, which means testing at least on 3 should be recommended
before a patch is committed: (BINARY + UTF-8 + EUC ?).

- similar conversion to the "internal" encoding would be required
for all strings from other engines, databases and Rack, regardless
of whether they are in UTF-8 or not. As for Rack and strings
submitted through forms, they should ultimately be also in the
"internal" encoding and not BINARY (unless "internal"
*is*
BINARY), but getting this to work is a can of worms in itself
(AFAIK, this is true for native Japanese sites, where assuming
UTF-8 is almost never valid).

- there are a few other places where ERB is used, but I prefer to
leave that until this single case is solved. Fixing other
template issues should be done separately.

I hope this is enough to be committed into 2-3-stable, IMHO. At least
as a first step after many months of threads, discussions, issues,
tickets, articles, without any fully acceptable patches or progress.

Also, I believe the tickets in LH need some love - just to straighten
out the issue and introduce more clarity. The best results would be to
start closing the tickets with definite conclusions and guidelines, so
that people start using Ruby 1.9 with Rails, so plugin developers in
turn get enough time and feedback to get things right.

IMPORTANT: I had intention of offending anyone by the following
digests - I just wanted to provide an overview of the lack of
progress, the complexity of issue and the willingness to help, despite
months without progress. I admit I have no idea what prevented the
problem from being solved a long time ago.

Ticket #2188:

https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038
1. Incorrect mention of I18n and #2038 as similar error
2. Correctly identified problem (Hector E. Gomez Morales)
3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector)
4. Unintentional hijacking with a MySQL problem (crazy_bug)
5. MySQL DB problem redirected to #2476 (Hector)
6. Unintentional hijacking with a HAML problem (Portfonica)
7. Jakub Kuźma identifies a wider set of problems
8. Jakub Kuźma identifies Rack problems
9. Adam S talks about setting default encoding in Rails
10. Jérôme points out the need for a default encoding for erb
files
11. Jeremy Kemper notes that the reports are not really helpful
12. Rocco Di Leo provides detailed test case, but formatting
problems make it unreadable
13. Adam S suggests solving the problem by converting ASCII ->
UTF8
14. hkstar mentions the lack of progress
15. Jeremy Kemper notes that the issue still hasn''t been properly
investigated
16. Turns into a discussion about UTF-8 support in 1.9
17. Andrew Grim proposes alternative patch that honors ERB
template encoding
18. ahaller notes strange behaviour in ERB
19. Marcello Barnaba proposes general monkey patch for ActionView,
probably related to Rack issues
20. UVSoft proposes patch for HAML
21. Alberto describes the problem - just as Hector did
22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH

What I propose is combining the two patches above to close this
issue, and give references to non-related tickets which give a
similar error.

#Ticket 1988: Make utf8 partial rendering from within a content_for work in
ruby1.9
https://rails.lighthouseapp.com/projects/8994/tickets/1988
1. Patch that works around the issue
2. Jeremy Kemper does not accept the patch due to being utf-8 - only
3. TICKET STATUS IS INCOMPLETE

What I propose is solving #2188 first and then investigate this
bug further - it could be a bad assumption about the encoding of
strings returned by tag helpers in a specific case.

#Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby
1.9.1
https://rails.lighthouseapp.com/projects/8994/tickets/2476
1. Hector describe database adaptor problem with 1.9 encodings,
provides a mysql-ruby fork and other links
2. Patches and fixes for databases / adaptors (James Healy, Jakub
Kuźma, Yugui)
3. Talk about assuming UTF-8 for databases
4. Loren Segal proposes hack instead of modifying mysql-ruby
5. Micheal Hasensein asks about issue 5 months later
6. UVSoft accidentally posts HAML workaround
6. TICKET STATUS IS NEW

My proposal - after fixing #2188, a short description of
adapters/databases and fixed versions could be presented - and
possibly have this issue closed, to prevent it being listed as a
pending UTF-8 issue. Work could be started on validation code for
the strings returned by database adapters and their compatibility
with the "internal" encoding.

Open/new tickets related to Rack:

https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app

https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio

https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors

My proposal: gather issues and investigate with the help of people
working with non-utf and non-ascii input - I believe Japan is such
a country, where UTF-8 assumptions about Rack input are wrong.

I would like to thank everyone who invested even the slightest bit of
time in solving this issue.

I hope the information here will help find a solution that will work
without issues for years to come and that creating Rails applications
will be an enjoyable experience for users, designers, developers,
translators and all contributors, regardless of their environment and
language preferences.

--
Cezary Baginski

Jeremy Kemper

2010-Apr-19 18:30 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

On Mon, Apr 19, 2010 at 6:58 AM, Czarek <cezary.baginski@gmail.com>
wrote:> SUMMARY:
> --------
>
> I tried to identify the general and root causes for these problems
> with 1.9, by taking into account non-utf encoding, current patches,
> comments and ideas. I used ticket #2188 as base for explanations.
>
> This is a long read. I wanted to include all the relevant information
> in one place. I also included information about related tickets in LH
> and their status. I decided that adding parts of this to LH would just
> add to the confusion.
>
> Two patches are included (one is from Andrew Grim) that should fix one
> issue (#2188) in a way, that fixes the problem and doesn''t break
> anything. Two small steps for Rails, one giant step for proper
> encoding support. I hope.
>
> I welcome any feedback that would help get Rails closer to fully
> supporting Ruby 1.9 and vice-versa.
>
> SOLUTION:
> ---------
>
> The general idea is: allow only one "internal" encoding in Rails
at
> any given time, based on the default Ruby encoding (or configurable).
>
> And treat any incoming external strings that cannot be converted to
> this "internal" encoding as errors in the gems, which they occur.
And
> possibly report mismatches before they even "enter" Rails, by
> attempting to convert them into the "internal" encoding
immediately.
>
> As a result of enforcing this, all Rails tests should work with any
> encoding, that is a superset of the encodings used for input (db,
> Rack, ERB, Haml, ...) in a given environment.
>
> With a optimal setup (db encoding, Ruby encoding, Rack encoding
> settings, I18n translations, ...), no transcoding will occur during
> the rendering process, no matter what the default Rails encoding is
> used (including ASCII_8BIT), and no force_encoding would be needed
> internally in Rails, except as workarounds for gems and libraries
> where this is difficult otherwise.
>
> The guideline for gem and plugin developers would be: do not create or
> return strings (other than internal use) that are not compatible with
> the default encoding both ways.
>
> In some cases, it may be acceptable to drop or escape characters that
> cannot be transcoded (maybe Rack input, for example).
+1

> The idea is based on:
>
>  - Jeremy Kemper''s strong attitude toward avoiding solutions
>    requiring UTF-8 as default or forcing it
>
>  - Yehuda''s opinion about using UTF-8 as default in Ruby instead
of
>    ASCII-8BIT
>
>  - James Edward Gray''s solution for encoding issues in CSV
>
>  - the multitude of ways to set the encoding in Ruby
>
>  - giving everyone the liberty to use any encoding they want for any
>    task, without the need of porting and modifying existing code if
>    possible
>
>  - personal experience with many encoding pitfalls
>
>
> For those interested in Ruby encoding support, I very much recommend
> the extremely well written in-depth article by James Edward Gray II:
>
>    http://blog.grayproductions.net/articles/understanding_m17n
>
>
> Results of "Please do investigate":
> ----------------------------------
>
> The ticket:
>
>  #2188: (March 9th, 2009):  Encoding error in Ruby1.9 for templates
>
> Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just
an
> alias for "BINARY". This is actually ok, except for the way Ruby
1.9
> handles concat with a non-BINARY string, e.g. UTF-8:
>
>  >>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
>  Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT
and UTF-8
>
> Although the following works (equivalent to how Ruby 1.8 works):
>
>  >>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY''))
>  => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
>
> The surprise is that it "sometimes works", when a string contains
only valid ASCII-7
> characters, giving the impression that a patch fixed the problem:
>
>  >>
''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
>  => "abc語"
>
> (I used force_encoding here for consistency in different locale
> settings).
>
> Solutions that come into mind:
> -----------------------------
>
>  1. force_encoding should not be used, unless really necessary, and
>  this rule should be applied to ERB. Unfortunately, I have no idea
>  why ERB uses force_encoding, but I can come up with a few reasons,
>  the main one being: Rails uses ERB (a general lib) for a specific
>  purpose and requiring a non-ASCII-8BIT encoding is just as specific.
>  I would really like an opinion on this.
I don''t know why ERB forces encoding to ASCII-8BIT in the absence of a
magic comment. See r21170. The ERB compiler should probably take a
default source encoding option that''s used if the magic comment is
missing.
>  2. Don''t use ERB. AFAIK, this is why Rails 3.0 works.
Using Erubis is a possibility as well.
>  3. Treat everything as binary, since the resulting file is sent to a
>  browser, which will detect the encoding anyway. This is also
doesn''t
>  affect performance, but it ruins the whole idea of having encoding
>  support, possibly breaking test frameworks instead.
-1
>  4. Force UTF-8. This is the brute-force idea used in many patches
>  and workarounds, and this prevents commits from happening. People
>  should have a right to use non-utf8 ERB files and render in any
>  encoding e.g. EUC-JP.
-1
>  5. Try to be intelligent, and guess. This means handling
>  everything, except BINARY. The problem is how do we know what
>  encoding to use for template input? And what encoding do we use for
>  output?
We could set a single default encoding for the app, like we''re doing in
Rails 3.
> Solution 1 would be best, but with force_encoding already in the wild
> with Ruby 1.9, including ruby-head.  So that leaves solution 5. Option
> 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require
> all template input strings to be set to BINARY.
>
> Solution 5
> ----------
>
>  force_encoding has to be used at least once somewhere in
>  Rails - to fix what ERB "breaks", but on what basis should the
>  encoding be selected? For performance, there should be no
>  transcoding during rendering, unless absolutely necessary.
>
>  When we think about it, the output depends on what we want the
>  browser to receive, and that is why many people are pushing UTF-8:
>  the layout usually has UTF-8 anyway, and it would otherwise have to
>  be parsed to get the encoding from the content-type value.
>
>  The input using in rendering a template is a mixture of what web
>  designers provide, the translators use, the databases return and
>  Rack emits, among other things.
>
>  The policy in Rails could be: "don''t allow multiple
encodings
>  during template rendering". I believe the effort required to do
>  otherwise is not be justified.
>
>  This would force other gem developers to provide a way to set or
>  read the correct encoding they use or stick with the current
>  default. In this case (#2188), ERB has to either provide a way to
>  either return the result in a encoding specified by Rails, or the
>  ERB handler should be adapted to provide this functionality.
>
>  The problem with this: ERB templates do not have an embedded
>  encoding. Which means we need a way to specify the encoding used in
>  the template.
>
>  Andrew Grim fixes this in his patch here:
>
>
 https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff
>
> I am only worried about the default case, when no encoding is set.
> "ASCII_8BIT", the result of ERB, is not acceptable, unless the
> "internal" encoding would also be BINARY. I would propose merging
the
> following with the patch above:
>
>      def compile(template)
>        input = "<% __in_erb_template=true
%>#{template.source}"
>        src = ::ERB.new(input, nil, erb_trim_mode,
''@output_buffer'').src
>
>        if RUBY_VERSION >= ''1.9'' and src.encoding !=
input.encoding
>          if src.encoding == Encoding::ASCII_8BIT
>            src = src.force_encoding(input.encoding) #ERB workaround
>          else
>            src = src.encode(input.encoding)
>          end
>        end
>
>        # Ruby 1.9 prepends an encoding to the source. However this is
>        # useless because you can only set an encoding on the first line
>        RUBY_VERSION >= ''1.9'' ?
src.sub(/\A#coding:.*\n/, '''') : src
>      end
The ERB compiler is supposed to preserve the input file''s source
encoding unless it has a magic comment. Puzzled why this is necessary.
It should also be fixed in ERB itself, I think.
>  And here is an example test case, similar to many others already in
>  the tickets, which shows the issue:
>
>    <%= "日本" %><%=
"語".force_encoding("UTF-8") %>
>
> A few things here to note (for both patches put together):
>
>  - the fallback encoding would be assumed to be the same as ruby
>    default, which can be set by the locale, RUBYOPT with -K option,
>    or using Encoding.default_*. I believe this is sufficient
>    flexibility.
>
>  - note that there are no assumptions regarding the charset and the
>    ASCII_8BIT case is handled with this in mind
>
>  - obviously, test cases would be executed with different Ruby
>    encoding defaults - testing one setup no longer guarantees
>    anything. Rails tests should work with almost any default
>    encoding, which means testing at least on 3 should be recommended
>    before a patch is committed: (BINARY + UTF-8 + EUC ?).
>
>  - similar conversion to the "internal" encoding would be
required
>    for all strings from other engines, databases and Rack, regardless
>    of whether they are in UTF-8 or not. As for Rack and strings
>    submitted through forms, they should ultimately be also in the
>    "internal" encoding and not BINARY (unless
"internal" *is*
>    BINARY), but getting this to work is a can of worms in itself
>    (AFAIK, this is true for native Japanese sites, where assuming
>    UTF-8 is almost never valid).
>
>  - there are a few other places where ERB is used, but I prefer to
>    leave that until this single case is solved. Fixing other
>    template issues should be done separately.
>
> I hope this is enough to be committed into 2-3-stable, IMHO. At least
> as a first step after many months of threads, discussions, issues,
> tickets, articles, without any fully acceptable patches or progress.
>
> Also, I believe the tickets in LH need some love - just to straighten
> out the issue and introduce more clarity. The best results would be to
> start closing the tickets with definite conclusions and guidelines, so
> that people start using Ruby 1.9 with Rails, so plugin developers in
> turn get enough time and feedback to get things right.
>
> IMPORTANT: I had intention of offending anyone by the following
> digests - I just wanted to provide an overview of the lack of
> progress, the complexity of issue and the willingness to help, despite
> months without progress.  I admit I have no idea what prevented the
> problem from being solved a long time ago.
>
>  Ticket #2188:
>
 https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038
>    1. Incorrect mention of I18n and #2038 as similar error
>    2. Correctly identified problem (Hector E. Gomez Morales)
>    3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector)
>    4. Unintentional hijacking with a MySQL problem (crazy_bug)
>    5. MySQL DB problem redirected to #2476 (Hector)
>    6. Unintentional hijacking with a HAML problem (Portfonica)
>    7. Jakub Kuźma identifies a wider set of problems
>    8. Jakub Kuźma identifies Rack problems
>    9. Adam S talks about setting default encoding in Rails
>    10. Jérôme points out the need for a default encoding for erb
>    files
>    11. Jeremy Kemper notes that the reports are not really helpful
>    12. Rocco Di Leo provides detailed test case, but formatting
>    problems make it unreadable
>    13. Adam S suggests solving the problem by converting ASCII ->
>    UTF8
>    14. hkstar mentions the lack of progress
>    15. Jeremy Kemper notes that the issue still hasn''t been
properly
>    investigated
>    16. Turns into a discussion about UTF-8 support in 1.9
>    17. Andrew Grim proposes alternative patch that honors ERB
>    template encoding
>    18. ahaller notes strange behaviour in ERB
>    19. Marcello Barnaba proposes general monkey patch for ActionView,
>    probably related to Rack issues
>    20. UVSoft proposes patch for HAML
>    21. Alberto describes the problem - just as Hector did
>    22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH
>
>    What I propose is combining the two patches above to close this
>    issue, and give references to non-related tickets which give a
>    similar error.
Ok, good. They''ll need to be rebased against master, and I think
Andrew''s patch breaks some tests since it changes the ERB line
numbers.

>  #Ticket 1988: Make utf8 partial rendering from within a content_for work
in ruby1.9
>  https://rails.lighthouseapp.com/projects/8994/tickets/1988
>    1. Patch that works around the issue
>    2. Jeremy Kemper does not accept the patch due to being utf-8 - only
>    3. TICKET STATUS IS INCOMPLETE
>
>    What I propose is solving #2188 first and then investigate this
>    bug further - it could be a bad assumption about the encoding of
>    strings returned by tag helpers in a specific case.
>
>  #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby
1.9.1
>  https://rails.lighthouseapp.com/projects/8994/tickets/2476
>    1. Hector describe database adaptor problem with 1.9 encodings,
>    provides a mysql-ruby fork and other links
>    2. Patches and fixes for databases / adaptors (James Healy, Jakub
>    Kuźma, Yugui)
>    3. Talk about assuming UTF-8 for databases
>    4. Loren Segal proposes hack instead of modifying mysql-ruby
>    5. Micheal Hasensein asks about issue 5 months later
>    6. UVSoft accidentally posts HAML workaround
>    6. TICKET STATUS IS NEW
>
>    My proposal - after fixing #2188, a short description of
>    adapters/databases and fixed versions could be presented - and
>    possibly have this issue closed, to prevent it being listed as a
>    pending UTF-8 issue. Work could be started on validation code for
>    the strings returned by database adapters and their compatibility
>    with the "internal" encoding.
+1

>    Open/new tickets related to Rack:
>
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors
>
>    My proposal: gather issues and investigate with the help of people
>    working with non-utf and non-ascii input - I believe Japan is such
>    a country, where UTF-8 assumptions about Rack input are wrong.
Rack is woefully lagging on encoding support. It needs an encoding
push of its own.

Ruby CGI has updated to include just-enough support, e.g. for giving
an encoding for parsed query parameters.
> I would like to thank everyone who invested even the slightest bit of
> time in solving this issue.
>
> I hope the information here will help find a solution that will work
> without issues for years to come and that creating Rails applications
> will be an enjoyable experience for users, designers, developers,
> translators and all contributors, regardless of their environment and
> language preferences.
Indeed! Thanks for leading the charge, Cezary.

jeremy

-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Jonas Nicklas

2010-Apr-19 20:28 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

It''s great to see someone finally take charge of this! I still
don''t
have the greatest grasp of character encodings, but what you''re
suggesting sounds good.

Maybe one additional thing: make all generators put the magic comment
with the standard encoding at the top of all source files they create.
Does that sound like a good idea? Should we open a ticket for it?

Just to clarify how important this issue is: Rails 2.3 claims to be
Ruby 1.9 compatible, but until this is fixed, even the most trivial of
applications simply don''t work on 1.9, especially if the application
is in a language that often uses non-ASCII characters (pretty much
anything other than English, in other words). This has prevented me
from moving to Ruby 1.9.

/Jonas

On Mon, Apr 19, 2010 at 3:58 PM, Czarek <cezary.baginski@gmail.com>
wrote:> SUMMARY:
> --------
>
> I tried to identify the general and root causes for these problems
> with 1.9, by taking into account non-utf encoding, current patches,
> comments and ideas. I used ticket #2188 as base for explanations.
>
> This is a long read. I wanted to include all the relevant information
> in one place. I also included information about related tickets in LH
> and their status. I decided that adding parts of this to LH would just
> add to the confusion.
>
> Two patches are included (one is from Andrew Grim) that should fix one
> issue (#2188) in a way, that fixes the problem and doesn''t break
> anything. Two small steps for Rails, one giant step for proper
> encoding support. I hope.
>
> I welcome any feedback that would help get Rails closer to fully
> supporting Ruby 1.9 and vice-versa.
>
> SOLUTION:
> ---------
>
> The general idea is: allow only one "internal" encoding in Rails
at
> any given time, based on the default Ruby encoding (or configurable).
>
> And treat any incoming external strings that cannot be converted to
> this "internal" encoding as errors in the gems, which they occur.
And
> possibly report mismatches before they even "enter" Rails, by
> attempting to convert them into the "internal" encoding
immediately.
>
> As a result of enforcing this, all Rails tests should work with any
> encoding, that is a superset of the encodings used for input (db,
> Rack, ERB, Haml, ...) in a given environment.
>
> With a optimal setup (db encoding, Ruby encoding, Rack encoding
> settings, I18n translations, ...), no transcoding will occur during
> the rendering process, no matter what the default Rails encoding is
> used (including ASCII_8BIT), and no force_encoding would be needed
> internally in Rails, except as workarounds for gems and libraries
> where this is difficult otherwise.
>
> The guideline for gem and plugin developers would be: do not create or
> return strings (other than internal use) that are not compatible with
> the default encoding both ways.
>
> In some cases, it may be acceptable to drop or escape characters that
> cannot be transcoded (maybe Rack input, for example).
>
>
> The idea is based on:
>
>  - Jeremy Kemper''s strong attitude toward avoiding solutions
>    requiring UTF-8 as default or forcing it
>
>  - Yehuda''s opinion about using UTF-8 as default in Ruby instead
of
>    ASCII-8BIT
>
>  - James Edward Gray''s solution for encoding issues in CSV
>
>  - the multitude of ways to set the encoding in Ruby
>
>  - giving everyone the liberty to use any encoding they want for any
>    task, without the need of porting and modifying existing code if
>    possible
>
>  - personal experience with many encoding pitfalls
>
>
> For those interested in Ruby encoding support, I very much recommend
> the extremely well written in-depth article by James Edward Gray II:
>
>    http://blog.grayproductions.net/articles/understanding_m17n
>
>
> Results of "Please do investigate":
> ----------------------------------
>
> The ticket:
>
>  #2188: (March 9th, 2009):  Encoding error in Ruby1.9 for templates
>
> Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just
an
> alias for "BINARY". This is actually ok, except for the way Ruby
1.9
> handles concat with a non-BINARY string, e.g. UTF-8:
>
>  >>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
>  Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT
and UTF-8
>
> Although the following works (equivalent to how Ruby 1.8 works):
>
>  >>
''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY''))
>  => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
>
> The surprise is that it "sometimes works", when a string contains
only valid ASCII-7
> characters, giving the impression that a patch fixed the problem:
>
>  >>
''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8''))
>  => "abc語"
>
> (I used force_encoding here for consistency in different locale
> settings).
>
> Solutions that come into mind:
> -----------------------------
>
>  1. force_encoding should not be used, unless really necessary, and
>  this rule should be applied to ERB. Unfortunately, I have no idea
>  why ERB uses force_encoding, but I can come up with a few reasons,
>  the main one being: Rails uses ERB (a general lib) for a specific
>  purpose and requiring a non-ASCII-8BIT encoding is just as specific.
>  I would really like an opinion on this.
>
>  2. Don''t use ERB. AFAIK, this is why Rails 3.0 works.
>
>  3. Treat everything as binary, since the resulting file is sent to a
>  browser, which will detect the encoding anyway. This is also
doesn''t
>  affect performance, but it ruins the whole idea of having encoding
>  support, possibly breaking test frameworks instead.
>
>  4. Force UTF-8. This is the brute-force idea used in many patches
>  and workarounds, and this prevents commits from happening. People
>  should have a right to use non-utf8 ERB files and render in any
>  encoding e.g. EUC-JP.
>
>  5. Try to be intelligent, and guess. This means handling
>  everything, except BINARY. The problem is how do we know what
>  encoding to use for template input? And what encoding do we use for
>  output?
>
> Solution 1 would be best, but with force_encoding already in the wild
> with Ruby 1.9, including ruby-head.  So that leaves solution 5. Option
> 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require
> all template input strings to be set to BINARY.
>
> Solution 5
> ----------
>
>  force_encoding has to be used at least once somewhere in
>  Rails - to fix what ERB "breaks", but on what basis should the
>  encoding be selected? For performance, there should be no
>  transcoding during rendering, unless absolutely necessary.
>
>  When we think about it, the output depends on what we want the
>  browser to receive, and that is why many people are pushing UTF-8:
>  the layout usually has UTF-8 anyway, and it would otherwise have to
>  be parsed to get the encoding from the content-type value.
>
>  The input using in rendering a template is a mixture of what web
>  designers provide, the translators use, the databases return and
>  Rack emits, among other things.
>
>  The policy in Rails could be: "don''t allow multiple
encodings
>  during template rendering". I believe the effort required to do
>  otherwise is not be justified.
>
>  This would force other gem developers to provide a way to set or
>  read the correct encoding they use or stick with the current
>  default. In this case (#2188), ERB has to either provide a way to
>  either return the result in a encoding specified by Rails, or the
>  ERB handler should be adapted to provide this functionality.
>
>  The problem with this: ERB templates do not have an embedded
>  encoding. Which means we need a way to specify the encoding used in
>  the template.
>
>  Andrew Grim fixes this in his patch here:
>
>
 https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff
>
> I am only worried about the default case, when no encoding is set.
> "ASCII_8BIT", the result of ERB, is not acceptable, unless the
> "internal" encoding would also be BINARY. I would propose merging
the
> following with the patch above:
>
>      def compile(template)
>        input = "<% __in_erb_template=true
%>#{template.source}"
>        src = ::ERB.new(input, nil, erb_trim_mode,
''@output_buffer'').src
>
>        if RUBY_VERSION >= ''1.9'' and src.encoding !=
input.encoding
>          if src.encoding == Encoding::ASCII_8BIT
>            src = src.force_encoding(input.encoding) #ERB workaround
>          else
>            src = src.encode(input.encoding)
>          end
>        end
>
>        # Ruby 1.9 prepends an encoding to the source. However this is
>        # useless because you can only set an encoding on the first line
>        RUBY_VERSION >= ''1.9'' ?
src.sub(/\A#coding:.*\n/, '''') : src
>      end
>
>  And here is an example test case, similar to many others already in
>  the tickets, which shows the issue:
>
>    <%= "日本" %><%=
"語".force_encoding("UTF-8") %>
>
> A few things here to note (for both patches put together):
>
>  - the fallback encoding would be assumed to be the same as ruby
>    default, which can be set by the locale, RUBYOPT with -K option,
>    or using Encoding.default_*. I believe this is sufficient
>    flexibility.
>
>  - note that there are no assumptions regarding the charset and the
>    ASCII_8BIT case is handled with this in mind
>
>  - obviously, test cases would be executed with different Ruby
>    encoding defaults - testing one setup no longer guarantees
>    anything. Rails tests should work with almost any default
>    encoding, which means testing at least on 3 should be recommended
>    before a patch is committed: (BINARY + UTF-8 + EUC ?).
>
>  - similar conversion to the "internal" encoding would be
required
>    for all strings from other engines, databases and Rack, regardless
>    of whether they are in UTF-8 or not. As for Rack and strings
>    submitted through forms, they should ultimately be also in the
>    "internal" encoding and not BINARY (unless
"internal" *is*
>    BINARY), but getting this to work is a can of worms in itself
>    (AFAIK, this is true for native Japanese sites, where assuming
>    UTF-8 is almost never valid).
>
>  - there are a few other places where ERB is used, but I prefer to
>    leave that until this single case is solved. Fixing other
>    template issues should be done separately.
>
> I hope this is enough to be committed into 2-3-stable, IMHO. At least
> as a first step after many months of threads, discussions, issues,
> tickets, articles, without any fully acceptable patches or progress.
>
> Also, I believe the tickets in LH need some love - just to straighten
> out the issue and introduce more clarity. The best results would be to
> start closing the tickets with definite conclusions and guidelines, so
> that people start using Ruby 1.9 with Rails, so plugin developers in
> turn get enough time and feedback to get things right.
>
> IMPORTANT: I had intention of offending anyone by the following
> digests - I just wanted to provide an overview of the lack of
> progress, the complexity of issue and the willingness to help, despite
> months without progress.  I admit I have no idea what prevented the
> problem from being solved a long time ago.
>
>  Ticket #2188:
>
 https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038
>    1. Incorrect mention of I18n and #2038 as similar error
>    2. Correctly identified problem (Hector E. Gomez Morales)
>    3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector)
>    4. Unintentional hijacking with a MySQL problem (crazy_bug)
>    5. MySQL DB problem redirected to #2476 (Hector)
>    6. Unintentional hijacking with a HAML problem (Portfonica)
>    7. Jakub Kuźma identifies a wider set of problems
>    8. Jakub Kuźma identifies Rack problems
>    9. Adam S talks about setting default encoding in Rails
>    10. Jérôme points out the need for a default encoding for erb
>    files
>    11. Jeremy Kemper notes that the reports are not really helpful
>    12. Rocco Di Leo provides detailed test case, but formatting
>    problems make it unreadable
>    13. Adam S suggests solving the problem by converting ASCII ->
>    UTF8
>    14. hkstar mentions the lack of progress
>    15. Jeremy Kemper notes that the issue still hasn''t been
properly
>    investigated
>    16. Turns into a discussion about UTF-8 support in 1.9
>    17. Andrew Grim proposes alternative patch that honors ERB
>    template encoding
>    18. ahaller notes strange behaviour in ERB
>    19. Marcello Barnaba proposes general monkey patch for ActionView,
>    probably related to Rack issues
>    20. UVSoft proposes patch for HAML
>    21. Alberto describes the problem - just as Hector did
>    22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH
>
>    What I propose is combining the two patches above to close this
>    issue, and give references to non-related tickets which give a
>    similar error.
>
>
>  #Ticket 1988: Make utf8 partial rendering from within a content_for work
in ruby1.9
>  https://rails.lighthouseapp.com/projects/8994/tickets/1988
>    1. Patch that works around the issue
>    2. Jeremy Kemper does not accept the patch due to being utf-8 - only
>    3. TICKET STATUS IS INCOMPLETE
>
>    What I propose is solving #2188 first and then investigate this
>    bug further - it could be a bad assumption about the encoding of
>    strings returned by tag helpers in a specific case.
>
>  #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby
1.9.1
>  https://rails.lighthouseapp.com/projects/8994/tickets/2476
>    1. Hector describe database adaptor problem with 1.9 encodings,
>    provides a mysql-ruby fork and other links
>    2. Patches and fixes for databases / adaptors (James Healy, Jakub
>    Kuźma, Yugui)
>    3. Talk about assuming UTF-8 for databases
>    4. Loren Segal proposes hack instead of modifying mysql-ruby
>    5. Micheal Hasensein asks about issue 5 months later
>    6. UVSoft accidentally posts HAML workaround
>    6. TICKET STATUS IS NEW
>
>    My proposal - after fixing #2188, a short description of
>    adapters/databases and fixed versions could be presented - and
>    possibly have this issue closed, to prevent it being listed as a
>    pending UTF-8 issue. Work could be started on validation code for
>    the strings returned by database adapters and their compatibility
>    with the "internal" encoding.
>
>
>    Open/new tickets related to Rack:
>
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio
>  
 https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors
>
>    My proposal: gather issues and investigate with the help of people
>    working with non-utf and non-ascii input - I believe Japan is such
>    a country, where UTF-8 assumptions about Rack input are wrong.
>
> I would like to thank everyone who invested even the slightest bit of
> time in solving this issue.
>
> I hope the information here will help find a solution that will work
> without issues for years to come and that creating Rails applications
> will be an enjoyable experience for users, designers, developers,
> translators and all contributors, regardless of their environment and
> language preferences.
>
> --
> Cezary Baginski
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkvMYYMACgkQgEYXSknSpI/llgCfavXgCMfl5ueJPUrwptSil092
> eTEAoK7viEHYiHnmrS5rHXPwmpCAYV8c
> =CHR3
> -----END PGP SIGNATURE-----
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Czarek

2010-Apr-25 02:32 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

Here are some updates I have sinced I started working on LH #2188
until a patch I submitted there. Although the patch specifically fixes
ERB using workarounds in the Rails ERB handler, I tried to make the
approach as generic as possible.

On Mon, Apr 19, 2010 at 11:30:16AM -0700, Jeremy Kemper
wrote:> On Mon, Apr 19, 2010 at 6:58 AM, Czarek <cezary.baginski@gmail.com>
wrote:
> > The general idea is: allow only one "internal" encoding in
Rails at
> > any given time, based on the default Ruby encoding (or configurable).
I chose Encoding::default_external for this. 

The short story is that Encoding::default_internal shouldn''t really
matter for Rails.
> > As a result of enforcing this, all Rails tests should work with any
> > encoding
Probably the most convenient way to test this is: 

  RUBYOPT=-Ke rake tests 

See #4466 for an example test script for ActionPack and the trivial
fixes that make everything work.
> > The guideline for gem and plugin developers would be: do not create or
> > return strings (other than internal use) that are not compatible with
> > the default encoding both ways.
> >
> > In some cases, it may be acceptable to drop or escape characters that
> > cannot be transcoded (maybe Rack input, for example).
> 
> +1
String#{encode,encode!} have both nice options for replacing
characters and provide almost all the necessary functionality
(force_encoding handles a few other surprise cases). Rack, and
converting between incompatible encoding are places where this seems
useful.
> I don''t know why ERB forces encoding to ASCII-8BIT in the absence
of a
> magic comment. See r21170. The ERB compiler should probably take a
> default source encoding option that''s used if the magic comment is
> missing.
Two issues are worth mentioning: regexes have their own
encoding semantics and force_encoding is actually necessary if you
want to "encode" a string to or from ascii-8bit specifically.

ERB uses a regex to detect the encoding comment, but the regex has to
have the same encoding as the source stream, so ERB uses ASCII-8BIT to
be able to run the regex on the stream, regardless of the stream''s
encoding. 

Then ERB continues to use that ASCII-8BIT string for compiling, which
seems to be ok, because the strings are passed to eval, with and
encoding comment in the beginning...

The problem actually lies elsewhere: ERB didn''t detect the encoding,
because the encoding magic wasn''t in the first tag. The first tag was
added by Rails ERB handler:

  "<% __in_erb_template=true %><%# encoding ...."

Andrew Grim worked this out and created a patch for this in #2188.

Should ERB search the whole stream for an encoding tag? Or should
Rails guarantee the first tag has the encoding information? I believe
the second option will save more time. Erubis is also a reason to
forget about patching ERB directly.
> Using Erubis is a possibility as well.
Patching the ERB problem taught me that although this will solve many
encoding issues and headaches, it may unfortunately hide a few general
design flaws that should be worked on before Rails 3.0 or Ruby 1.9.2
become production ready.

The workarounds I used for patching ERB seem actually quite generic.
They allow one to have partials in different encodings and even have
ASCII-8BIT as the Ruby default_external without breaking anything.
And any encoding incompatibilities occur during encode! calls in the
ERB handler - close to the problem.

Something similar could be done for db adapters, because just like the
template handler being ERB instead od Erubis, people can have
old/broken libs, gems and plugins. And since Rails is becoming more
modular with 3.0, additional issues may surface, slowing down
development in the long run.
> 
> >  3. Treat everything as binary, since the resulting file is sent to a
> >  browser, which will detect the encoding anyway. This is also
doesn''t
> >  affect performance, but it ruins the whole idea of having encoding
> >  support, possibly breaking test frameworks instead.
> 
> -1
Actually, it turns out that supporting everything as binary takes
really no more effort than supporting multiple encoding and it is a
good way to test Rails, applications and gems. ASCII-8BIT is the most
restrictive when it comes to encoding making it ideal for regression
tests. Allowing an application to support ASCII-8BIT through
default_external requires more effort, but is worth it.
> >  4. Force UTF-8. This is the brute-force idea used in many patches
> >  and workarounds, and this prevents commits from happening. People
> >  should have a right to use non-utf8 ERB files and render in any
> >  encoding e.g. EUC-JP.
> 
> -1
Complementary to ASCII-8BIT, UTF-8 is ideal for an ''internal''
encoding
and for detecting cases where ASCII-8BIT is (mis)used. UTF-8 should
actually *be* used when there are multiple - incompatible otherwise -
encodings. Ruby 1.8 just glues anything together, but in 1.9
everything should first be encoded to something as general as UTF-8
before encoded to ASCII-8BIT (if there is such a need). For example,
this would allow people to make ISO2022_JP web pages from EUC-JP
templates and SJIS databases - by using UTF-8 as the internal
encoding.

Although choosing UTF-8 seems wrong, in this case it prevents us from
loosing encoding information from converting to ASCII-8BIT.
> We could set a single default encoding for the app, like we''re
doing
> in Rails 3.
I admit I haven''t even tried Rails 3.0. Shame on me. 

A single default encoding within rails is a must to gracefully handle
the example I gave above (with EUC, SJIS and ISO2022). Of course UTF-8
is reasonable, but there is no reason to assume UTF-8 for all cases.
> The ERB compiler is supposed to preserve the input file''s source
> encoding unless it has a magic comment. Puzzled why this is necessary.
> It should also be fixed in ERB itself, I think.
Rails inserts code that breaks ERB''s magic comment detection. How does
Erubis handle the issue? Does it regex the stream?
> >  - obviously, test cases would be executed with different Ruby
> >    encoding defaults - testing one setup no longer guarantees
> >    anything. Rails tests should work with almost any default
> >    encoding, which means testing at least on 3 should be recommended
> >    before a patch is committed: (BINARY + UTF-8 + EUC ?).
Actually, all 5 cases could be used in Rails tests and in apps:

  - no K option, Ks (sjis), Ke (euc-jp), Ku (utf-8), Kn
    (binary/ascii-8bit)

ActionPack is trivial to fix. Other Rails gems may require more work.
> 
> Ok, good. They''ll need to be rebased against master, and I think
> Andrew''s patch breaks some tests since it changes the ERB line
> numbers.
I haven''t noticed this. Could you provide some details? I am wondering
how I missed this.

I didn''t check his patch too thoroughly, since I was busy getting a
patch #2188 out the door.

I only checked my own patch (based on his) on ActionPack and
ActiveSupport. Currently, everything seems to work, so let me know if
I looked something over.
> Rack is woefully lagging on encoding support. It needs an encoding
> push of its own.
> 
> Ruby CGI has updated to include just-enough support, e.g. for giving
> an encoding for parsed query parameters.
I would handle Rack last or at least after Rails tests work in all the
encodings. The reason is: I learned not to underestimate encoding
problems and leaving Rack for last seems like a good choice.
> Indeed! Thanks for leading the charge, Cezary.
I''m happy to helpful in some way. 
> 
> jeremy
-- 
Cezary Baginski

Czarek

2010-Apr-25 11:01 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

On Mon, Apr 19, 2010 at 10:28:56PM +0200, Jonas Nicklas
wrote:> It''s great to see someone finally take charge of this! I still
don''t
> have the greatest grasp of character encodings, but what you''re
> suggesting sounds good.
Thanks :)
> Maybe one additional thing: make all generators put the magic comment
> with the standard encoding at the top of all source files they create.
> Does that sound like a good idea? Should we open a ticket for it?
This is a great idea, since people new to Rails usually both are new
to Ruby and use generators. The question is how do we choose the
encoding? Consider the following:

   % LC_CTYPE=en_US ruby -e ''p
IO.read("_foo.rhtml").encoding''
  #<Encoding:US-ASCII>

  % LC_CTYPE=en_US.UTF-8 ruby -e ''p
IO.read("_foo.rhtml").encoding''
  #<Encoding:UTF-8>

This is important for partials. People will eventually create partials
without the encoding information, which will be rendered from
templates. I would prefer us-ascii to be used by generators instead of
Ruby''s Encoding::default_external for the following reasons:

  - user may have a non-UTF8 environment, and us-ascii will more likely
    give an error closer in the call stack to the file without the
    encoding comment

  - user shouldn''t really use non ascii characters in partials and
    templates - i18n is the solution and will help localize the
    application when it goes global

  - this would help adopt using ''# encoding: us-ascii'' as a
no-brainer
    solution instead of ''# encoding: utf-8'' which usually just
makes
    problems more obscure

The only upside to using UTF-8 at all instead is quickly fixing huge
sites with many localized pages, but generators are for new projects
anyway.

So, by all means, yes, please open a ticket, since this may not be too
trivial and encoding issues will more likely need good understanding
rather than assuming Rails can and will magically fix everything.
> Just to clarify how important this issue is: Rails 2.3 claims to be
> Ruby 1.9 compatible, but until this is fixed, even the most trivial of
> applications simply don''t work on 1.9, especially if the
application
> is in a language that often uses non-ASCII characters (pretty much
> anything other than English, in other words). This has prevented me
> from moving to Ruby 1.9.
The m17n support in Ruby > 1.9 is a great concept. Unfortunately
balancing:
  - correctness 
  - performance
  - robustness in a production environment
quickly turns encoding problems into philosophical debates. Without a
deep understanding of encoding internal it is too easy to "fix" things
by just converting to UTF-8, hiding the real issues. 

Thanks for bringing this up!
> 
> /Jonas
> 
-- 
Cezary Baginski

michael.hasenstein@googlemail.com

2010-Apr-25 19:25 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

I disagree. There are lots of apps written for just one specific
country without any intention of going global. Besides, one can have
locale-specific view files, can''t we? Having "to i18n" each
and every
string is a little bit too much. Of course, the folks in the US won''t
notice, you guys are well off while the rest of the world suffers from
such a policy...

On Apr 25, 1:01 pm, Czarek <cezary.bagin...@gmail.com> wrote:
....>   - user shouldn''t really use non ascii characters in partials and
>     templates - i18n is the solution and will help localize the
>     application when it goes global...

-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Czarek

2010-Apr-25 21:28 UTC

head link

Re: Re: Overview of Ruby 1.9 encoding problem tickets

On Sun, Apr 25, 2010 at 12:25:44PM -0700, michael.hasenstein@googlemail.com
wrote:> I disagree. There are lots of apps written for just one specific
> country without any intention of going global. Besides, one can have
> locale-specific view files, can''t we? Having "to i18n"
each and every
> string is a little bit too much. Of course, the folks in the US
won''t
> notice, you guys are well off while the rest of the world suffers from
> such a policy...
Forgive me for not making the context clear. There is no
''policy''
here, just a suggested generator default behavior for users writing
mainly US applications, possibly wishing to easily globalize their
applications in the future. In *this* case specifically, my
conclusions are:

  - using utf-8 instead of ascii-us for encoding comments hide
    problems for those users

  - people with no experience in encodings other than us-ascii will
    forget the encoding comments more often than not

  - Ruby 1.9 chokes when trying to convert two non us-ascii compatible
    strings

  - generators could create files with ascii-us by default to prevent
    the above

If that case does not describe your own, chances are you already know
what you are doing and Rails gives you all the freedom you can get to
adapt things to your own situation, choosing the right tool for the
right job.

The reason for the proposed generator default is *exactly* to help
people unaware of encoding problems to deliver applications that spare
others the suffering and grief.
> 
> On Apr 25, 1:01 pm, Czarek <cezary.bagin...@gmail.com> wrote:
> ....
> >   - user shouldn''t really use non ascii characters in
partials and
> >     templates - i18n is the solution and will help localize the
> >     application when it goes global
> ...
-- 
Cezary Baginski

Paul Sponagl

2010-Apr-26 08:30 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

> - user shouldn''t really use non ascii characters in partials and
>   templates - i18n is the solution and will help localize the
>   application when it goes global
-1

if you know that a rails app will run only within one country within a
controllable group (e.g. intranet apps) it does not make much sense adding the
overhead of seperate language files.

>> Just to clarify how important this issue is: Rails 2.3 claims to be
>> Ruby 1.9 compatible, but until this is fixed, even the most trivial of
>> applications simply don''t work on 1.9, especially if the
application
>> is in a language that often uses non-ASCII characters (pretty much
>> anything other than English, in other words). This has prevented me
>> from moving to Ruby 1.9.
> 
> The m17n support in Ruby > 1.9 is a great concept. Unfortunately
> balancing:
> - correctness 
> - performance
> - robustness in a production environment
> quickly turns encoding problems into philosophical debates. Without a
> deep understanding of encoding internal it is too easy to "fix"
things
> by just converting to UTF-8, hiding the real issues. 
well - i "upgraded" our site running in germany to ruby1.9.1, unicorn
and rails 2.3.6
even with using  utf-8 as a default i had to make various patches within rack to
get it up and running.

rack: utils

   # Unescapes a URI escaped string. (Stolen from Camping).
   def unescape(s)
     result = s.tr(''+'', ''
'').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
       [$1.delete(''%'')].pack(''H*'')
     }               
     RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8)
: result
   end
   module_function :unescape

found at lighthouse...




the next one is horrible - i know, but it works for now:

   def parse_query(qs, d = nil)
     params = {}

     (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do
|p|
       k, v = p.split(''='', 2).map { |x| unescape(x) }
       begin
         if v =~ /^("|'')(.*)\1$/
           v = $2.gsub(''\\''+$1, $1)
         end
       rescue
         v.force_encoding(''ISO-8859-1'')
         v.encode!(''UTF-8'',:invalid => :replace, :undef
=> :replace, :replace => '''')
         if v =~ /^("|'')(.*)\1$/
           v = $2.gsub(''\\''+$1, $1)
         end
       end

(we use analytics at the site - analytics stores the last search query within a
cookie. If a user will browse google and finds the site with an umlaut query
this query will be stored within the cookie. parse_query will be used by rack to
parse cookies too. guess what - it wil go booom if you use utf-8 as a default
and get an incoming cookie with an different encoding../)



the next ugly thing :)

   def normalize_params(params, name, v = nil)
     if v and v =~ /^("|'')(.*)\1$/
       v = $2.gsub(''\\''+$1, $1)
     end
     name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
     k = $1 || ''''
     after = $'' || ''''

     return if k.empty?


     if after == "" 
       params[k] = (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
       # params[k] = v
     elsif after == "[]"
       params[k] ||= []
       raise TypeError, "expected Array (got #{params[k].class.name}) for
param `#{k}''" unless params[k].is_a?(Array)
       params[k] << (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
       # params[k] << v
     elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)


all patches i found did not include the multipart solution ... this hack makes
sure that multipart variables will be utf-8 forced  too ...


Yes / i am glad and thank you that you made this overdue summary!
i hope others will have a better start into the ruby1.9 rails 2.3 world as me. 
In fact there were times i really wondered why someones dares to state that
rails is 1.9 compatible for a real world (not real US) app!

Thanks a lot!




!DSPAM:4bd553b359886468012210!


-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Czarek

2010-Apr-26 10:55 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

On Mon, Apr 26, 2010 at 10:30:16AM +0200, Paul Sponagl
wrote:> 
> > - user shouldn''t really use non ascii characters in partials
and
> >   templates - i18n is the solution and will help localize the
> >   application when it goes global
> 
> -1
> 
> if you know that a rails app will run only within one country within
> a controllable group (e.g. intranet apps) it does not make much
> sense adding the overhead of seperate language files.
I didn''t correctly state what I meant and thank you for helping me
realize that :) 

What I did mean was that users shouldn''t assume non-ascii characters
will always work correctly with Ruby 1.9, without specifying encoding
comments or assuring specific, correct environment settings. So, let
me rephrase myself:

  Users should not be able to use non-ascii characters in a us-ascii
  environment without providing an alternative encoding comment or
  overriding the environment settings. If neither of these are
  acceptable, i18n is a suggestion.
  
This behavior would be consistent with the way Ruby loads source
files. The reason is that doing otherwise can give obscure, hard to
track encoding problems, looking like Rails bugs.

By supplying a _default_ "us-ascii" encoding comment in generated
template files, we help people oblivious to encoding details to do the
right thing or do the necessary research (i18n, change encoding
comments, localized versions of pages, etc).

Encoding problems can be so frustrating, it is easy to perceive US
developers as being ignorant. The truth is, it is unusual for them to
even experience the problems or reproduce without effort, let alone
research ways to test the issues effectively. This feature may
slightly help with the latter.

Suggestion
----------

I am wondering if Rails could actually assume us-ascii for all types
of template files without a specified encoding, or emit a warning 
unless Ruby is running in full UTF-8 mode (-Ku) or full binary (-Kn). 

The fix would be adding encoding comments to all the files and may
mean a lot of work for existing projects. On the other hand, this is
more consistent with how Ruby 1.9 handles source files, so it won''t be
a surprise to anyone. 

This would prevent people from forgetting to put encoding comments in
partials, for example. And if this would really be troublesome, people
can always stick to Ruby 1.8 or run their servers with -Ku or even
-Kn. 

Would anyone care to comment on this idea?
> >> Just to clarify how important this issue is: Rails 2.3 claims to
be
> >> Ruby 1.9 compatible, but until this is fixed, even the most
trivial of
> >> applications simply don''t work on 1.9, especially if the
application
> >> is in a language that often uses non-ASCII characters (pretty much
> >> anything other than English, in other words). This has prevented
me
> >> from moving to Ruby 1.9.
> > 
> > The m17n support in Ruby > 1.9 is a great concept. Unfortunately
> > balancing:
> > - correctness 
> > - performance
> > - robustness in a production environment
> > quickly turns encoding problems into philosophical debates. Without a
> > deep understanding of encoding internal it is too easy to
"fix" things
> > by just converting to UTF-8, hiding the real issues. 
> 
> well - i "upgraded" our site running in germany to ruby1.9.1,
unicorn and rails 2.3.6
> even with using  utf-8 as a default i had to make various patches within
rack to get it up and running.
> 
> rack: utils
> 
>    # Unescapes a URI escaped string. (Stolen from Camping).
>    def unescape(s)
>      result = s.tr(''+'', ''
'').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
>        [$1.delete(''%'')].pack(''H*'')
>      }               
>      RUBY_VERSION >= "1.9" ?
result.force_encoding(Encoding::UTF_8) : result
>    end
>    module_function :unescape
> 
> found at lighthouse...
> 
> 
> 
> 
> the next one is horrible - i know, but it works for now:
> 
>    def parse_query(qs, d = nil)
>      params = {}
> 
>      (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each
do |p|
>        k, v = p.split(''='', 2).map { |x| unescape(x) }
>        begin
>          if v =~ /^("|'')(.*)\1$/
>            v = $2.gsub(''\\''+$1, $1)
>          end
>        rescue
>          v.force_encoding(''ISO-8859-1'')
>          v.encode!(''UTF-8'',:invalid => :replace,
:undef => :replace, :replace => '''')
>          if v =~ /^("|'')(.*)\1$/
>            v = $2.gsub(''\\''+$1, $1)
>          end
>        end
> 
> (we use analytics at the site - analytics stores the last search query
within a cookie. If a user will browse google and finds the site with an umlaut
query this query will be stored within the cookie. parse_query will be used by
rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a
default and get an incoming cookie with an different encoding../)
> 
> 
> 
> the next ugly thing :)
> 
>    def normalize_params(params, name, v = nil)
>      if v and v =~ /^("|'')(.*)\1$/
>        v = $2.gsub(''\\''+$1, $1)
>      end
>      name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
>      k = $1 || ''''
>      after = $'' || ''''
> 
>      return if k.empty?
> 
> 
>      if after == "" 
>        params[k] = (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
>        # params[k] = v
>      elsif after == "[]"
>        params[k] ||= []
>        raise TypeError, "expected Array (got #{params[k].class.name})
for param `#{k}''" unless params[k].is_a?(Array)
>        params[k] << (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
>        # params[k] << v
>      elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)
> 
> 
> all patches i found did not include the multipart solution ... this hack
makes sure that multipart variables will be utf-8 forced  too ...
> 
> 
> Yes / i am glad and thank you that you made this overdue summary!
>
> i hope others will have a better start into the ruby1.9 rails 2.3
> world as me.  In fact there were times i really wondered why
> someones dares to state that rails is 1.9 compatible for a real
> world (not real US) app!
> 
> Thanks a lot!
And thank you too for helping out! Especially for giving the summary
of rack issues with patches, which obviously saved me hours of
research. 

It make be a while before Rails 2.3 becomes 1.9 compatible as the
result of detailed test cases and well thought out, politically
correct patches, but it is so encouraging to see Rails users not
giving up!

Thanks again!

-- 
Cezary Baginski

Paul Sponagl

2010-Apr-26 11:52 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

i am very busy in writing new features for our project so i do not have the
time/brainspace/brainpower
now to think about clean solutions, but just one more - might help you too:

in my application_controller i added the very very very bad (i know - do not
blame me - its working :)
charset test for routes with specialized chars as i wanted to use paths with
umlauts too.
if a browser/search-bot defaults to ISO requesting  www.domain.com/über will
obviously break things
when using "über".force_encoding(''utf-8'') within
rails...


  REGEXP_ISO =
Regexp.new(''[^\xc3][\xe4\xf6\xfc\xc4\xd6\xdc\xdf]'', nil,
''n'')
  REGEXP_MACROMAN =
Regexp.new(''[^\xc3][\x8a\x9a\x9f\x80\x85\x86\xa7]'', nil,
''n'')

  def check_params_encoding( key )
    unless params[key].blank?
      params[key].force_encoding(''ASCII-8BIT'')
      if params[key].match(REGEXP_ISO)        
        params[key].force_encoding(''ISO-8859-1'')
        params[key].encode!(''UTF-8'',:invalid => :replace,
:undef => :replace, :replace => '''')
      elsif params[key].match(REGEXP_MACROMAN)
        params[key].force_encoding(''macRoman'')
        params[key].encode!(''UTF-8'',:invalid => :replace,
:undef => :replace, :replace => '''')
      end  
    end
    params[key].force_encoding(Encoding::UTF_8)
  end   

btw. i switched in one step from 

ruby 1.8.7 => ruby 1.9
backgroundrb => delayed_job
ferret => sphinx
thin => unicorn
2.3.2 => 2.3.6
(memcached as frontend cache)

and i have to say (after blood, sweat and tears exceptions on the production
servers leading to those quick hacks ;)

IT ROCKS :) ! 

no more aaf ferret issues - fast searches - slim job workers - painless fast
restarts - no more (un)fair balancing !

good luck for your projects with rails and all the best on your 1.9 travel !

Paul 

Am 26.04.2010 um 12:55 schrieb Czarek:
> On Mon, Apr 26, 2010 at 10:30:16AM +0200, Paul Sponagl wrote:
>> 
>>> - user shouldn''t really use non ascii characters in
partials and
>>>  templates - i18n is the solution and will help localize the
>>>  application when it goes global
>> 
>> -1
>> 
>> if you know that a rails app will run only within one country within
>> a controllable group (e.g. intranet apps) it does not make much
>> sense adding the overhead of seperate language files.
> 
> I didn''t correctly state what I meant and thank you for helping me
> realize that :) 
> 
> What I did mean was that users shouldn''t assume non-ascii
characters
> will always work correctly with Ruby 1.9, without specifying encoding
> comments or assuring specific, correct environment settings. So, let
> me rephrase myself:
> 
>  Users should not be able to use non-ascii characters in a us-ascii
>  environment without providing an alternative encoding comment or
>  overriding the environment settings. If neither of these are
>  acceptable, i18n is a suggestion.
> 
> This behavior would be consistent with the way Ruby loads source
> files. The reason is that doing otherwise can give obscure, hard to
> track encoding problems, looking like Rails bugs.
> 
> By supplying a _default_ "us-ascii" encoding comment in generated
> template files, we help people oblivious to encoding details to do the
> right thing or do the necessary research (i18n, change encoding
> comments, localized versions of pages, etc).
> 
> Encoding problems can be so frustrating, it is easy to perceive US
> developers as being ignorant. The truth is, it is unusual for them to
> even experience the problems or reproduce without effort, let alone
> research ways to test the issues effectively. This feature may
> slightly help with the latter.
> 
> Suggestion
> ----------
> 
> I am wondering if Rails could actually assume us-ascii for all types
> of template files without a specified encoding, or emit a warning 
> unless Ruby is running in full UTF-8 mode (-Ku) or full binary (-Kn). 
> 
> The fix would be adding encoding comments to all the files and may
> mean a lot of work for existing projects. On the other hand, this is
> more consistent with how Ruby 1.9 handles source files, so it
won''t be
> a surprise to anyone. 
> 
> This would prevent people from forgetting to put encoding comments in
> partials, for example. And if this would really be troublesome, people
> can always stick to Ruby 1.8 or run their servers with -Ku or even
> -Kn. 
> 
> Would anyone care to comment on this idea?
> 
>>>> Just to clarify how important this issue is: Rails 2.3 claims
to be
>>>> Ruby 1.9 compatible, but until this is fixed, even the most
trivial of
>>>> applications simply don''t work on 1.9, especially if
the application
>>>> is in a language that often uses non-ASCII characters (pretty
much
>>>> anything other than English, in other words). This has
prevented me
>>>> from moving to Ruby 1.9.
>>> 
>>> The m17n support in Ruby > 1.9 is a great concept. Unfortunately
>>> balancing:
>>> - correctness 
>>> - performance
>>> - robustness in a production environment
>>> quickly turns encoding problems into philosophical debates. Without
a
>>> deep understanding of encoding internal it is too easy to
"fix" things
>>> by just converting to UTF-8, hiding the real issues. 
>> 
>> well - i "upgraded" our site running in germany to ruby1.9.1,
unicorn and rails 2.3.6
>> even with using  utf-8 as a default i had to make various patches
within rack to get it up and running.
>> 
>> rack: utils
>> 
>>   # Unescapes a URI escaped string. (Stolen from Camping).
>>   def unescape(s)
>>     result = s.tr(''+'', ''
'').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
>>       [$1.delete(''%'')].pack(''H*'')
>>     }               
>>     RUBY_VERSION >= "1.9" ?
result.force_encoding(Encoding::UTF_8) : result
>>   end
>>   module_function :unescape
>> 
>> found at lighthouse...
>> 
>> 
>> 
>> 
>> the next one is horrible - i know, but it works for now:
>> 
>>   def parse_query(qs, d = nil)
>>     params = {}
>> 
>>     (qs || '''').split(d ? /[#{d}] */n :
DEFAULT_SEP).each do |p|
>>       k, v = p.split(''='', 2).map { |x| unescape(x) }
>>       begin
>>         if v =~ /^("|'')(.*)\1$/
>>           v = $2.gsub(''\\''+$1, $1)
>>         end
>>       rescue
>>         v.force_encoding(''ISO-8859-1'')
>>         v.encode!(''UTF-8'',:invalid => :replace,
:undef => :replace, :replace => '''')
>>         if v =~ /^("|'')(.*)\1$/
>>           v = $2.gsub(''\\''+$1, $1)
>>         end
>>       end
>> 
>> (we use analytics at the site - analytics stores the last search query
within a cookie. If a user will browse google and finds the site with an umlaut
query this query will be stored within the cookie. parse_query will be used by
rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a
default and get an incoming cookie with an different encoding../)
>> 
>> 
>> 
>> the next ugly thing :)
>> 
>>   def normalize_params(params, name, v = nil)
>>     if v and v =~ /^("|'')(.*)\1$/
>>       v = $2.gsub(''\\''+$1, $1)
>>     end
>>     name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
>>     k = $1 || ''''
>>     after = $'' || ''''
>> 
>>     return if k.empty?
>> 
>> 
>>     if after == "" 
>>       params[k] = (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
>>       # params[k] = v
>>     elsif after == "[]"
>>       params[k] ||= []
>>       raise TypeError, "expected Array (got
#{params[k].class.name}) for param `#{k}''" unless
params[k].is_a?(Array)
>>       params[k] << (RUBY_VERSION >= "1.9" &&
v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
>>       # params[k] << v
>>     elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)
>> 
>> 
>> all patches i found did not include the multipart solution ... this
hack makes sure that multipart variables will be utf-8 forced  too ...
>> 
>> 
>> Yes / i am glad and thank you that you made this overdue summary!
>> 
>> i hope others will have a better start into the ruby1.9 rails 2.3
>> world as me.  In fact there were times i really wondered why
>> someones dares to state that rails is 1.9 compatible for a real
>> world (not real US) app!
>> 
>> Thanks a lot!
> 
> And thank you too for helping out! Especially for giving the summary
> of rack issues with patches, which obviously saved me hours of
> research. 
> 
> It make be a while before Rails 2.3 becomes 1.9 compatible as the
> result of detailed test cases and well thought out, politically
> correct patches, but it is so encouraging to see Rails users not
> giving up!
> 
> Thanks again!
> 

!DSPAM:4bd5832759881659720503!


-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Paul Sponagl

2010-Apr-27 11:41 UTC

head link

Re: Overview of Ruby 1.9 encoding problem tickets

yesterday i found another situation where i got hit by the 1.9 encoding
problems.

Believe it or not i''ve seen a case at our site where IE8 sends
ISO-encoded uris after recieving the page incl. the link in UTF-8.
I thought this is a save one - but it is not!

Now i decided to add a very simple module force_recoding within lib 
(find it below / yes, could / should?! be a - rchardet like - native?! kernel
method)
and patch rack utils and rails - request.

btw. rchardet even in the 1.9 version of http://github.com/speedmax/rchardet did
not work.

for now i would say - do not use rails with 1.9 outside the us unless you have
fun debugging on production servers -
and make sure that exception_notification works! - this last error prevented it
from sending mails as erb got crazy while spitting the iso string into an utf-8
context
... i was informed by users  ... smells like 1995 ...

and please rails core - write down the encoding problems within "Improved
compatibility with Ruby 1.9" at
http://weblog.rubyonrails.org/2009/11/30/ruby-on-rails-2-3-5-released and help
newcomers get the right trail to rails!

Now that i am working with rails for about 3 years - i can say i have at least a
bit of experience - a newcomer will never use rails again when facing this kind
of hard to track down errors. (i.m.O. only segfaults could be worse!)


i patched rack/utils.rb:
---------------8< ------------

 # -*- encoding: binary -*-
 
 require ''set''
 require ''tempfile''
+require ''force_recoding''
 
 module Rack
   # Rack::Utils contains a grab-bag of useful methods for writing web
   # applications adopted from all kinds of Ruby libraries.
 
   module Utils
+
+    include ForceRecoding
+    module_function :force_recoding
+
     # Performs URI escaping so that you can construct proper
     # query strings faster. Use this rather than the cgi.rb
     # version since it''s faster. (Stolen from Camping).
@@ -21,9 +25,10 @@
 
     # Unescapes a URI escaped string. (Stolen from Camping).
     def unescape(s)
-      s.tr(''+'', ''
'').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
+      result = s.tr(''+'', ''
'').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
         [$1.delete(''%'')].pack(''H*'')
       }
+      result = force_recoding( result )
     end
     module_function :unescape
 
@@ -32,16 +37,23 @@
     # Stolen from Mongrel, with some small modifications:
     # Parses a query string by breaking it up at the ''&''
     # and '';'' characters. You can also use this to parse
     # cookies by changing the characters used in the second
     # parameter (which defaults to ''&;'').
     def parse_query(qs, d = nil)
       params = {}
 
       (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do
|p|
         k, v = p.split(''='', 2).map { |x| unescape(x) }
+        begin
+          if v =~ /^("|'')(.*)\1$/
+            v = $2.gsub(''\\''+$1, $1)
+          end
+        rescue 
+          v = force_recoding( v )
         if v =~ /^("|'')(.*)\1$/
           v = $2.gsub(''\\''+$1, $1)
         end
+        end
         if cur = params[k]
           if cur.class == Array
             params[k] << v
@@ -79,12 +91,15 @@
 
       return if k.empty?
 
+
       if after == ""
-        params[k] = v
+        params[k] = force_recoding( v )
+        # params[k] = v
       elsif after == "[]"
         params[k] ||= []
         raise TypeError, "expected Array (got #{params[k].class.name}) for
param `#{k}''" unless params[k].is_a?(Array)
-        params[k] << v
+        params[k] << force_recoding( v )
+        # params[k] << v
       elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)
         child_key = $1
         params[k] ||= []


and now within rails action_controller request.rb:
---------------8< ------------

    include ForceRecoding       
    # Returns the query string, accounting for server idiosyncrasies.
    def query_string
      @env[''QUERY_STRING''].present? ?
force_recoding(@env[''QUERY_STRING'']) :
(force_recoding(@env[''REQUEST_URI'']).split(''?'',
2)[1] || '''')
    end

    # Returns the request URI, accounting for server idiosyncrasies.
    # WEBrick includes the full URL. IIS leaves REQUEST_URI blank.
    def request_uri
      if uri = force_recoding(@env[''REQUEST_URI''])
        # Remove domain, which webrick puts into the request_uri.
        (%r{^\w+\://[^/]+(/.*|$)$} =~ uri) ? $1 : uri
      else
        # Construct IIS missing REQUEST_URI from SCRIPT_NAME and PATH_INFO.
        uri = force_recoding(@env[''PATH_INFO'']).to_s

        if script_filename =
@env[''SCRIPT_NAME''].to_s.match(%r{[^/]+$})
          uri = uri.sub(/#{script_filename}\//, '''')
        end

        env_qs = force_recoding(@env[''QUERY_STRING'']).to_s
        uri += "?#{env_qs}" unless env_qs.empty?

        if uri.blank?
          @env.delete(''REQUEST_URI'')
        else
          @env[''REQUEST_URI''] = uri
        end
      end
    end

here is the module:
---------------8< ------------
module ForceRecoding

  REGEXP_ISO =
Regexp.new(''[^\xc3][\xe4\xf6\xfc\xc4\xd6\xdc\xdf]'', nil,
''n'')
  REGEXP_MACROMAN =
Regexp.new(''[^\xc3][\x8a\x9a\x9f\x80\x85\x86\xa7]'', nil,
''n'')

  def force_recoding( str )
    return str if RUBY_VERSION < "1.9" || str.nil? ||
!str.is_a?(String)
    unless str.blank?
      str.force_encoding(''ASCII-8BIT'')
      if str.match(REGEXP_ISO)        
        str.force_encoding(''ISO-8859-1'')
        str.encode!(''UTF-8'',:invalid => :replace, :undef
=> :replace, :replace => '''')
      elsif str.match(REGEXP_MACROMAN)
        str.force_encoding(''macRoman'')
        str.encode!(''UTF-8'',:invalid => :replace, :undef
=> :replace, :replace => '''')
      end  
    end
    str.force_encoding(Encoding::UTF_8)
    str
  end

end


!DSPAM:4bd6d1e259885908015648!


-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Core" group.
To post to this group, send email to rubyonrails-core@googlegroups.com.
To unsubscribe from this group, send email to
rubyonrails-core+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-core?hl=en.

Possibly Parallel Threads

Search for more seemingly similar threads

Rails core - Apr 2010 - Overview of Ruby 1.9 encoding problem tickets

Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Re: Overview of Ruby 1.9 encoding problem tickets

Possibly Parallel Threads