SUMMARY: -------- I tried to identify the general and root causes for these problems with 1.9, by taking into account non-utf encoding, current patches, comments and ideas. I used ticket #2188 as base for explanations. This is a long read. I wanted to include all the relevant information in one place. I also included information about related tickets in LH and their status. I decided that adding parts of this to LH would just add to the confusion. Two patches are included (one is from Andrew Grim) that should fix one issue (#2188) in a way, that fixes the problem and doesn''t break anything. Two small steps for Rails, one giant step for proper encoding support. I hope. I welcome any feedback that would help get Rails closer to fully supporting Ruby 1.9 and vice-versa. SOLUTION: --------- The general idea is: allow only one "internal" encoding in Rails at any given time, based on the default Ruby encoding (or configurable). And treat any incoming external strings that cannot be converted to this "internal" encoding as errors in the gems, which they occur. And possibly report mismatches before they even "enter" Rails, by attempting to convert them into the "internal" encoding immediately. As a result of enforcing this, all Rails tests should work with any encoding, that is a superset of the encodings used for input (db, Rack, ERB, Haml, ...) in a given environment. With a optimal setup (db encoding, Ruby encoding, Rack encoding settings, I18n translations, ...), no transcoding will occur during the rendering process, no matter what the default Rails encoding is used (including ASCII_8BIT), and no force_encoding would be needed internally in Rails, except as workarounds for gems and libraries where this is difficult otherwise. The guideline for gem and plugin developers would be: do not create or return strings (other than internal use) that are not compatible with the default encoding both ways. In some cases, it may be acceptable to drop or escape characters that cannot be transcoded (maybe Rack input, for example). The idea is based on: - Jeremy Kemper''s strong attitude toward avoiding solutions requiring UTF-8 as default or forcing it - Yehuda''s opinion about using UTF-8 as default in Ruby instead of ASCII-8BIT - James Edward Gray''s solution for encoding issues in CSV - the multitude of ways to set the encoding in Ruby - giving everyone the liberty to use any encoding they want for any task, without the need of porting and modifying existing code if possible - personal experience with many encoding pitfalls For those interested in Ruby encoding support, I very much recommend the extremely well written in-depth article by James Edward Gray II: http://blog.grayproductions.net/articles/understanding_m17n Results of "Please do investigate": ---------------------------------- The ticket: #2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an alias for "BINARY". This is actually ok, except for the way Ruby 1.9 handles concat with a non-BINARY string, e.g. UTF-8: >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8 Although the following works (equivalent to how Ruby 1.8 works): >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY'')) => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" The surprise is that it "sometimes works", when a string contains only valid ASCII-7 characters, giving the impression that a patch fixed the problem: >> ''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) => "abc語" (I used force_encoding here for consistency in different locale settings). Solutions that come into mind: ----------------------------- 1. force_encoding should not be used, unless really necessary, and this rule should be applied to ERB. Unfortunately, I have no idea why ERB uses force_encoding, but I can come up with a few reasons, the main one being: Rails uses ERB (a general lib) for a specific purpose and requiring a non-ASCII-8BIT encoding is just as specific. I would really like an opinion on this. 2. Don''t use ERB. AFAIK, this is why Rails 3.0 works. 3. Treat everything as binary, since the resulting file is sent to a browser, which will detect the encoding anyway. This is also doesn''t affect performance, but it ruins the whole idea of having encoding support, possibly breaking test frameworks instead. 4. Force UTF-8. This is the brute-force idea used in many patches and workarounds, and this prevents commits from happening. People should have a right to use non-utf8 ERB files and render in any encoding e.g. EUC-JP. 5. Try to be intelligent, and guess. This means handling everything, except BINARY. The problem is how do we know what encoding to use for template input? And what encoding do we use for output? Solution 1 would be best, but with force_encoding already in the wild with Ruby 1.9, including ruby-head. So that leaves solution 5. Option 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require all template input strings to be set to BINARY. Solution 5 ---------- force_encoding has to be used at least once somewhere in Rails - to fix what ERB "breaks", but on what basis should the encoding be selected? For performance, there should be no transcoding during rendering, unless absolutely necessary. When we think about it, the output depends on what we want the browser to receive, and that is why many people are pushing UTF-8: the layout usually has UTF-8 anyway, and it would otherwise have to be parsed to get the encoding from the content-type value. The input using in rendering a template is a mixture of what web designers provide, the translators use, the databases return and Rack emits, among other things. The policy in Rails could be: "don''t allow multiple encodings during template rendering". I believe the effort required to do otherwise is not be justified. This would force other gem developers to provide a way to set or read the correct encoding they use or stick with the current default. In this case (#2188), ERB has to either provide a way to either return the result in a encoding specified by Rails, or the ERB handler should be adapted to provide this functionality. The problem with this: ERB templates do not have an embedded encoding. Which means we need a way to specify the encoding used in the template. Andrew Grim fixes this in his patch here: https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff I am only worried about the default case, when no encoding is set. "ASCII_8BIT", the result of ERB, is not acceptable, unless the "internal" encoding would also be BINARY. I would propose merging the following with the patch above: def compile(template) input = "<% __in_erb_template=true %>#{template.source}" src = ::ERB.new(input, nil, erb_trim_mode, ''@output_buffer'').src if RUBY_VERSION >= ''1.9'' and src.encoding != input.encoding if src.encoding == Encoding::ASCII_8BIT src = src.force_encoding(input.encoding) #ERB workaround else src = src.encode(input.encoding) end end # Ruby 1.9 prepends an encoding to the source. However this is # useless because you can only set an encoding on the first line RUBY_VERSION >= ''1.9'' ? src.sub(/\A#coding:.*\n/, '''') : src end And here is an example test case, similar to many others already in the tickets, which shows the issue: <%= "日本" %><%= "語".force_encoding("UTF-8") %> A few things here to note (for both patches put together): - the fallback encoding would be assumed to be the same as ruby default, which can be set by the locale, RUBYOPT with -K option, or using Encoding.default_*. I believe this is sufficient flexibility. - note that there are no assumptions regarding the charset and the ASCII_8BIT case is handled with this in mind - obviously, test cases would be executed with different Ruby encoding defaults - testing one setup no longer guarantees anything. Rails tests should work with almost any default encoding, which means testing at least on 3 should be recommended before a patch is committed: (BINARY + UTF-8 + EUC ?). - similar conversion to the "internal" encoding would be required for all strings from other engines, databases and Rack, regardless of whether they are in UTF-8 or not. As for Rack and strings submitted through forms, they should ultimately be also in the "internal" encoding and not BINARY (unless "internal" *is* BINARY), but getting this to work is a can of worms in itself (AFAIK, this is true for native Japanese sites, where assuming UTF-8 is almost never valid). - there are a few other places where ERB is used, but I prefer to leave that until this single case is solved. Fixing other template issues should be done separately. I hope this is enough to be committed into 2-3-stable, IMHO. At least as a first step after many months of threads, discussions, issues, tickets, articles, without any fully acceptable patches or progress. Also, I believe the tickets in LH need some love - just to straighten out the issue and introduce more clarity. The best results would be to start closing the tickets with definite conclusions and guidelines, so that people start using Ruby 1.9 with Rails, so plugin developers in turn get enough time and feedback to get things right. IMPORTANT: I had intention of offending anyone by the following digests - I just wanted to provide an overview of the lack of progress, the complexity of issue and the willingness to help, despite months without progress. I admit I have no idea what prevented the problem from being solved a long time ago. Ticket #2188: https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038 1. Incorrect mention of I18n and #2038 as similar error 2. Correctly identified problem (Hector E. Gomez Morales) 3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector) 4. Unintentional hijacking with a MySQL problem (crazy_bug) 5. MySQL DB problem redirected to #2476 (Hector) 6. Unintentional hijacking with a HAML problem (Portfonica) 7. Jakub Kuźma identifies a wider set of problems 8. Jakub Kuźma identifies Rack problems 9. Adam S talks about setting default encoding in Rails 10. Jérôme points out the need for a default encoding for erb files 11. Jeremy Kemper notes that the reports are not really helpful 12. Rocco Di Leo provides detailed test case, but formatting problems make it unreadable 13. Adam S suggests solving the problem by converting ASCII -> UTF8 14. hkstar mentions the lack of progress 15. Jeremy Kemper notes that the issue still hasn''t been properly investigated 16. Turns into a discussion about UTF-8 support in 1.9 17. Andrew Grim proposes alternative patch that honors ERB template encoding 18. ahaller notes strange behaviour in ERB 19. Marcello Barnaba proposes general monkey patch for ActionView, probably related to Rack issues 20. UVSoft proposes patch for HAML 21. Alberto describes the problem - just as Hector did 22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH What I propose is combining the two patches above to close this issue, and give references to non-related tickets which give a similar error. #Ticket 1988: Make utf8 partial rendering from within a content_for work in ruby1.9 https://rails.lighthouseapp.com/projects/8994/tickets/1988 1. Patch that works around the issue 2. Jeremy Kemper does not accept the patch due to being utf-8 - only 3. TICKET STATUS IS INCOMPLETE What I propose is solving #2188 first and then investigate this bug further - it could be a bad assumption about the encoding of strings returned by tag helpers in a specific case. #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 https://rails.lighthouseapp.com/projects/8994/tickets/2476 1. Hector describe database adaptor problem with 1.9 encodings, provides a mysql-ruby fork and other links 2. Patches and fixes for databases / adaptors (James Healy, Jakub Kuźma, Yugui) 3. Talk about assuming UTF-8 for databases 4. Loren Segal proposes hack instead of modifying mysql-ruby 5. Micheal Hasensein asks about issue 5 months later 6. UVSoft accidentally posts HAML workaround 6. TICKET STATUS IS NEW My proposal - after fixing #2188, a short description of adapters/databases and fixed versions could be presented - and possibly have this issue closed, to prevent it being listed as a pending UTF-8 issue. Work could be started on validation code for the strings returned by database adapters and their compatibility with the "internal" encoding. Open/new tickets related to Rack: https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors My proposal: gather issues and investigate with the help of people working with non-utf and non-ascii input - I believe Japan is such a country, where UTF-8 assumptions about Rack input are wrong. I would like to thank everyone who invested even the slightest bit of time in solving this issue. I hope the information here will help find a solution that will work without issues for years to come and that creating Rails applications will be an enjoyable experience for users, designers, developers, translators and all contributors, regardless of their environment and language preferences. -- Cezary Baginski
On Mon, Apr 19, 2010 at 6:58 AM, Czarek <cezary.baginski@gmail.com> wrote:> SUMMARY: > -------- > > I tried to identify the general and root causes for these problems > with 1.9, by taking into account non-utf encoding, current patches, > comments and ideas. I used ticket #2188 as base for explanations. > > This is a long read. I wanted to include all the relevant information > in one place. I also included information about related tickets in LH > and their status. I decided that adding parts of this to LH would just > add to the confusion. > > Two patches are included (one is from Andrew Grim) that should fix one > issue (#2188) in a way, that fixes the problem and doesn''t break > anything. Two small steps for Rails, one giant step for proper > encoding support. I hope. > > I welcome any feedback that would help get Rails closer to fully > supporting Ruby 1.9 and vice-versa. > > SOLUTION: > --------- > > The general idea is: allow only one "internal" encoding in Rails at > any given time, based on the default Ruby encoding (or configurable). > > And treat any incoming external strings that cannot be converted to > this "internal" encoding as errors in the gems, which they occur. And > possibly report mismatches before they even "enter" Rails, by > attempting to convert them into the "internal" encoding immediately. > > As a result of enforcing this, all Rails tests should work with any > encoding, that is a superset of the encodings used for input (db, > Rack, ERB, Haml, ...) in a given environment. > > With a optimal setup (db encoding, Ruby encoding, Rack encoding > settings, I18n translations, ...), no transcoding will occur during > the rendering process, no matter what the default Rails encoding is > used (including ASCII_8BIT), and no force_encoding would be needed > internally in Rails, except as workarounds for gems and libraries > where this is difficult otherwise. > > The guideline for gem and plugin developers would be: do not create or > return strings (other than internal use) that are not compatible with > the default encoding both ways. > > In some cases, it may be acceptable to drop or escape characters that > cannot be transcoded (maybe Rack input, for example).+1> The idea is based on: > > - Jeremy Kemper''s strong attitude toward avoiding solutions > requiring UTF-8 as default or forcing it > > - Yehuda''s opinion about using UTF-8 as default in Ruby instead of > ASCII-8BIT > > - James Edward Gray''s solution for encoding issues in CSV > > - the multitude of ways to set the encoding in Ruby > > - giving everyone the liberty to use any encoding they want for any > task, without the need of porting and modifying existing code if > possible > > - personal experience with many encoding pitfalls > > > For those interested in Ruby encoding support, I very much recommend > the extremely well written in-depth article by James Edward Gray II: > > http://blog.grayproductions.net/articles/understanding_m17n > > > Results of "Please do investigate": > ---------------------------------- > > The ticket: > > #2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates > > Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an > alias for "BINARY". This is actually ok, except for the way Ruby 1.9 > handles concat with a non-BINARY string, e.g. UTF-8: > > >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) > Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8 > > Although the following works (equivalent to how Ruby 1.8 works): > > >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY'')) > => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" > > The surprise is that it "sometimes works", when a string contains only valid ASCII-7 > characters, giving the impression that a patch fixed the problem: > > >> ''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) > => "abc語" > > (I used force_encoding here for consistency in different locale > settings). > > Solutions that come into mind: > ----------------------------- > > 1. force_encoding should not be used, unless really necessary, and > this rule should be applied to ERB. Unfortunately, I have no idea > why ERB uses force_encoding, but I can come up with a few reasons, > the main one being: Rails uses ERB (a general lib) for a specific > purpose and requiring a non-ASCII-8BIT encoding is just as specific. > I would really like an opinion on this.I don''t know why ERB forces encoding to ASCII-8BIT in the absence of a magic comment. See r21170. The ERB compiler should probably take a default source encoding option that''s used if the magic comment is missing.> 2. Don''t use ERB. AFAIK, this is why Rails 3.0 works.Using Erubis is a possibility as well.> 3. Treat everything as binary, since the resulting file is sent to a > browser, which will detect the encoding anyway. This is also doesn''t > affect performance, but it ruins the whole idea of having encoding > support, possibly breaking test frameworks instead.-1> 4. Force UTF-8. This is the brute-force idea used in many patches > and workarounds, and this prevents commits from happening. People > should have a right to use non-utf8 ERB files and render in any > encoding e.g. EUC-JP.-1> 5. Try to be intelligent, and guess. This means handling > everything, except BINARY. The problem is how do we know what > encoding to use for template input? And what encoding do we use for > output?We could set a single default encoding for the app, like we''re doing in Rails 3.> Solution 1 would be best, but with force_encoding already in the wild > with Ruby 1.9, including ruby-head. So that leaves solution 5. Option > 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require > all template input strings to be set to BINARY. > > Solution 5 > ---------- > > force_encoding has to be used at least once somewhere in > Rails - to fix what ERB "breaks", but on what basis should the > encoding be selected? For performance, there should be no > transcoding during rendering, unless absolutely necessary. > > When we think about it, the output depends on what we want the > browser to receive, and that is why many people are pushing UTF-8: > the layout usually has UTF-8 anyway, and it would otherwise have to > be parsed to get the encoding from the content-type value. > > The input using in rendering a template is a mixture of what web > designers provide, the translators use, the databases return and > Rack emits, among other things. > > The policy in Rails could be: "don''t allow multiple encodings > during template rendering". I believe the effort required to do > otherwise is not be justified. > > This would force other gem developers to provide a way to set or > read the correct encoding they use or stick with the current > default. In this case (#2188), ERB has to either provide a way to > either return the result in a encoding specified by Rails, or the > ERB handler should be adapted to provide this functionality. > > The problem with this: ERB templates do not have an embedded > encoding. Which means we need a way to specify the encoding used in > the template. > > Andrew Grim fixes this in his patch here: > > https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff > > I am only worried about the default case, when no encoding is set. > "ASCII_8BIT", the result of ERB, is not acceptable, unless the > "internal" encoding would also be BINARY. I would propose merging the > following with the patch above: > > def compile(template) > input = "<% __in_erb_template=true %>#{template.source}" > src = ::ERB.new(input, nil, erb_trim_mode, ''@output_buffer'').src > > if RUBY_VERSION >= ''1.9'' and src.encoding != input.encoding > if src.encoding == Encoding::ASCII_8BIT > src = src.force_encoding(input.encoding) #ERB workaround > else > src = src.encode(input.encoding) > end > end > > # Ruby 1.9 prepends an encoding to the source. However this is > # useless because you can only set an encoding on the first line > RUBY_VERSION >= ''1.9'' ? src.sub(/\A#coding:.*\n/, '''') : src > endThe ERB compiler is supposed to preserve the input file''s source encoding unless it has a magic comment. Puzzled why this is necessary. It should also be fixed in ERB itself, I think.> And here is an example test case, similar to many others already in > the tickets, which shows the issue: > > <%= "日本" %><%= "語".force_encoding("UTF-8") %> > > A few things here to note (for both patches put together): > > - the fallback encoding would be assumed to be the same as ruby > default, which can be set by the locale, RUBYOPT with -K option, > or using Encoding.default_*. I believe this is sufficient > flexibility. > > - note that there are no assumptions regarding the charset and the > ASCII_8BIT case is handled with this in mind > > - obviously, test cases would be executed with different Ruby > encoding defaults - testing one setup no longer guarantees > anything. Rails tests should work with almost any default > encoding, which means testing at least on 3 should be recommended > before a patch is committed: (BINARY + UTF-8 + EUC ?). > > - similar conversion to the "internal" encoding would be required > for all strings from other engines, databases and Rack, regardless > of whether they are in UTF-8 or not. As for Rack and strings > submitted through forms, they should ultimately be also in the > "internal" encoding and not BINARY (unless "internal" *is* > BINARY), but getting this to work is a can of worms in itself > (AFAIK, this is true for native Japanese sites, where assuming > UTF-8 is almost never valid). > > - there are a few other places where ERB is used, but I prefer to > leave that until this single case is solved. Fixing other > template issues should be done separately. > > I hope this is enough to be committed into 2-3-stable, IMHO. At least > as a first step after many months of threads, discussions, issues, > tickets, articles, without any fully acceptable patches or progress. > > Also, I believe the tickets in LH need some love - just to straighten > out the issue and introduce more clarity. The best results would be to > start closing the tickets with definite conclusions and guidelines, so > that people start using Ruby 1.9 with Rails, so plugin developers in > turn get enough time and feedback to get things right. > > IMPORTANT: I had intention of offending anyone by the following > digests - I just wanted to provide an overview of the lack of > progress, the complexity of issue and the willingness to help, despite > months without progress. I admit I have no idea what prevented the > problem from being solved a long time ago. > > Ticket #2188: > https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038 > 1. Incorrect mention of I18n and #2038 as similar error > 2. Correctly identified problem (Hector E. Gomez Morales) > 3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector) > 4. Unintentional hijacking with a MySQL problem (crazy_bug) > 5. MySQL DB problem redirected to #2476 (Hector) > 6. Unintentional hijacking with a HAML problem (Portfonica) > 7. Jakub Kuźma identifies a wider set of problems > 8. Jakub Kuźma identifies Rack problems > 9. Adam S talks about setting default encoding in Rails > 10. Jérôme points out the need for a default encoding for erb > files > 11. Jeremy Kemper notes that the reports are not really helpful > 12. Rocco Di Leo provides detailed test case, but formatting > problems make it unreadable > 13. Adam S suggests solving the problem by converting ASCII -> > UTF8 > 14. hkstar mentions the lack of progress > 15. Jeremy Kemper notes that the issue still hasn''t been properly > investigated > 16. Turns into a discussion about UTF-8 support in 1.9 > 17. Andrew Grim proposes alternative patch that honors ERB > template encoding > 18. ahaller notes strange behaviour in ERB > 19. Marcello Barnaba proposes general monkey patch for ActionView, > probably related to Rack issues > 20. UVSoft proposes patch for HAML > 21. Alberto describes the problem - just as Hector did > 22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH > > What I propose is combining the two patches above to close this > issue, and give references to non-related tickets which give a > similar error.Ok, good. They''ll need to be rebased against master, and I think Andrew''s patch breaks some tests since it changes the ERB line numbers.> #Ticket 1988: Make utf8 partial rendering from within a content_for work in ruby1.9 > https://rails.lighthouseapp.com/projects/8994/tickets/1988 > 1. Patch that works around the issue > 2. Jeremy Kemper does not accept the patch due to being utf-8 - only > 3. TICKET STATUS IS INCOMPLETE > > What I propose is solving #2188 first and then investigate this > bug further - it could be a bad assumption about the encoding of > strings returned by tag helpers in a specific case. > > #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 > https://rails.lighthouseapp.com/projects/8994/tickets/2476 > 1. Hector describe database adaptor problem with 1.9 encodings, > provides a mysql-ruby fork and other links > 2. Patches and fixes for databases / adaptors (James Healy, Jakub > Kuźma, Yugui) > 3. Talk about assuming UTF-8 for databases > 4. Loren Segal proposes hack instead of modifying mysql-ruby > 5. Micheal Hasensein asks about issue 5 months later > 6. UVSoft accidentally posts HAML workaround > 6. TICKET STATUS IS NEW > > My proposal - after fixing #2188, a short description of > adapters/databases and fixed versions could be presented - and > possibly have this issue closed, to prevent it being listed as a > pending UTF-8 issue. Work could be started on validation code for > the strings returned by database adapters and their compatibility > with the "internal" encoding.+1> Open/new tickets related to Rack: > > https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app > https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio > https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors > > My proposal: gather issues and investigate with the help of people > working with non-utf and non-ascii input - I believe Japan is such > a country, where UTF-8 assumptions about Rack input are wrong.Rack is woefully lagging on encoding support. It needs an encoding push of its own. Ruby CGI has updated to include just-enough support, e.g. for giving an encoding for parsed query parameters.> I would like to thank everyone who invested even the slightest bit of > time in solving this issue. > > I hope the information here will help find a solution that will work > without issues for years to come and that creating Rails applications > will be an enjoyable experience for users, designers, developers, > translators and all contributors, regardless of their environment and > language preferences.Indeed! Thanks for leading the charge, Cezary. jeremy -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
It''s great to see someone finally take charge of this! I still don''t have the greatest grasp of character encodings, but what you''re suggesting sounds good. Maybe one additional thing: make all generators put the magic comment with the standard encoding at the top of all source files they create. Does that sound like a good idea? Should we open a ticket for it? Just to clarify how important this issue is: Rails 2.3 claims to be Ruby 1.9 compatible, but until this is fixed, even the most trivial of applications simply don''t work on 1.9, especially if the application is in a language that often uses non-ASCII characters (pretty much anything other than English, in other words). This has prevented me from moving to Ruby 1.9. /Jonas On Mon, Apr 19, 2010 at 3:58 PM, Czarek <cezary.baginski@gmail.com> wrote:> SUMMARY: > -------- > > I tried to identify the general and root causes for these problems > with 1.9, by taking into account non-utf encoding, current patches, > comments and ideas. I used ticket #2188 as base for explanations. > > This is a long read. I wanted to include all the relevant information > in one place. I also included information about related tickets in LH > and their status. I decided that adding parts of this to LH would just > add to the confusion. > > Two patches are included (one is from Andrew Grim) that should fix one > issue (#2188) in a way, that fixes the problem and doesn''t break > anything. Two small steps for Rails, one giant step for proper > encoding support. I hope. > > I welcome any feedback that would help get Rails closer to fully > supporting Ruby 1.9 and vice-versa. > > SOLUTION: > --------- > > The general idea is: allow only one "internal" encoding in Rails at > any given time, based on the default Ruby encoding (or configurable). > > And treat any incoming external strings that cannot be converted to > this "internal" encoding as errors in the gems, which they occur. And > possibly report mismatches before they even "enter" Rails, by > attempting to convert them into the "internal" encoding immediately. > > As a result of enforcing this, all Rails tests should work with any > encoding, that is a superset of the encodings used for input (db, > Rack, ERB, Haml, ...) in a given environment. > > With a optimal setup (db encoding, Ruby encoding, Rack encoding > settings, I18n translations, ...), no transcoding will occur during > the rendering process, no matter what the default Rails encoding is > used (including ASCII_8BIT), and no force_encoding would be needed > internally in Rails, except as workarounds for gems and libraries > where this is difficult otherwise. > > The guideline for gem and plugin developers would be: do not create or > return strings (other than internal use) that are not compatible with > the default encoding both ways. > > In some cases, it may be acceptable to drop or escape characters that > cannot be transcoded (maybe Rack input, for example). > > > The idea is based on: > > - Jeremy Kemper''s strong attitude toward avoiding solutions > requiring UTF-8 as default or forcing it > > - Yehuda''s opinion about using UTF-8 as default in Ruby instead of > ASCII-8BIT > > - James Edward Gray''s solution for encoding issues in CSV > > - the multitude of ways to set the encoding in Ruby > > - giving everyone the liberty to use any encoding they want for any > task, without the need of porting and modifying existing code if > possible > > - personal experience with many encoding pitfalls > > > For those interested in Ruby encoding support, I very much recommend > the extremely well written in-depth article by James Edward Gray II: > > http://blog.grayproductions.net/articles/understanding_m17n > > > Results of "Please do investigate": > ---------------------------------- > > The ticket: > > #2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates > > Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an > alias for "BINARY". This is actually ok, except for the way Ruby 1.9 > handles concat with a non-BINARY string, e.g. UTF-8: > > >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) > Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8 > > Although the following works (equivalent to how Ruby 1.8 works): > > >> ''日本''.force_encoding(''BINARY'').concat(''語''.force_encoding(''BINARY'')) > => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" > > The surprise is that it "sometimes works", when a string contains only valid ASCII-7 > characters, giving the impression that a patch fixed the problem: > > >> ''abc''.force_encoding(''BINARY'').concat(''語''.force_encoding(''UTF-8'')) > => "abc語" > > (I used force_encoding here for consistency in different locale > settings). > > Solutions that come into mind: > ----------------------------- > > 1. force_encoding should not be used, unless really necessary, and > this rule should be applied to ERB. Unfortunately, I have no idea > why ERB uses force_encoding, but I can come up with a few reasons, > the main one being: Rails uses ERB (a general lib) for a specific > purpose and requiring a non-ASCII-8BIT encoding is just as specific. > I would really like an opinion on this. > > 2. Don''t use ERB. AFAIK, this is why Rails 3.0 works. > > 3. Treat everything as binary, since the resulting file is sent to a > browser, which will detect the encoding anyway. This is also doesn''t > affect performance, but it ruins the whole idea of having encoding > support, possibly breaking test frameworks instead. > > 4. Force UTF-8. This is the brute-force idea used in many patches > and workarounds, and this prevents commits from happening. People > should have a right to use non-utf8 ERB files and render in any > encoding e.g. EUC-JP. > > 5. Try to be intelligent, and guess. This means handling > everything, except BINARY. The problem is how do we know what > encoding to use for template input? And what encoding do we use for > output? > > Solution 1 would be best, but with force_encoding already in the wild > with Ruby 1.9, including ruby-head. So that leaves solution 5. Option > 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require > all template input strings to be set to BINARY. > > Solution 5 > ---------- > > force_encoding has to be used at least once somewhere in > Rails - to fix what ERB "breaks", but on what basis should the > encoding be selected? For performance, there should be no > transcoding during rendering, unless absolutely necessary. > > When we think about it, the output depends on what we want the > browser to receive, and that is why many people are pushing UTF-8: > the layout usually has UTF-8 anyway, and it would otherwise have to > be parsed to get the encoding from the content-type value. > > The input using in rendering a template is a mixture of what web > designers provide, the translators use, the databases return and > Rack emits, among other things. > > The policy in Rails could be: "don''t allow multiple encodings > during template rendering". I believe the effort required to do > otherwise is not be justified. > > This would force other gem developers to provide a way to set or > read the correct encoding they use or stick with the current > default. In this case (#2188), ERB has to either provide a way to > either return the result in a encoding specified by Rails, or the > ERB handler should be adapted to provide this functionality. > > The problem with this: ERB templates do not have an embedded > encoding. Which means we need a way to specify the encoding used in > the template. > > Andrew Grim fixes this in his patch here: > > https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff > > I am only worried about the default case, when no encoding is set. > "ASCII_8BIT", the result of ERB, is not acceptable, unless the > "internal" encoding would also be BINARY. I would propose merging the > following with the patch above: > > def compile(template) > input = "<% __in_erb_template=true %>#{template.source}" > src = ::ERB.new(input, nil, erb_trim_mode, ''@output_buffer'').src > > if RUBY_VERSION >= ''1.9'' and src.encoding != input.encoding > if src.encoding == Encoding::ASCII_8BIT > src = src.force_encoding(input.encoding) #ERB workaround > else > src = src.encode(input.encoding) > end > end > > # Ruby 1.9 prepends an encoding to the source. However this is > # useless because you can only set an encoding on the first line > RUBY_VERSION >= ''1.9'' ? src.sub(/\A#coding:.*\n/, '''') : src > end > > And here is an example test case, similar to many others already in > the tickets, which shows the issue: > > <%= "日本" %><%= "語".force_encoding("UTF-8") %> > > A few things here to note (for both patches put together): > > - the fallback encoding would be assumed to be the same as ruby > default, which can be set by the locale, RUBYOPT with -K option, > or using Encoding.default_*. I believe this is sufficient > flexibility. > > - note that there are no assumptions regarding the charset and the > ASCII_8BIT case is handled with this in mind > > - obviously, test cases would be executed with different Ruby > encoding defaults - testing one setup no longer guarantees > anything. Rails tests should work with almost any default > encoding, which means testing at least on 3 should be recommended > before a patch is committed: (BINARY + UTF-8 + EUC ?). > > - similar conversion to the "internal" encoding would be required > for all strings from other engines, databases and Rack, regardless > of whether they are in UTF-8 or not. As for Rack and strings > submitted through forms, they should ultimately be also in the > "internal" encoding and not BINARY (unless "internal" *is* > BINARY), but getting this to work is a can of worms in itself > (AFAIK, this is true for native Japanese sites, where assuming > UTF-8 is almost never valid). > > - there are a few other places where ERB is used, but I prefer to > leave that until this single case is solved. Fixing other > template issues should be done separately. > > I hope this is enough to be committed into 2-3-stable, IMHO. At least > as a first step after many months of threads, discussions, issues, > tickets, articles, without any fully acceptable patches or progress. > > Also, I believe the tickets in LH need some love - just to straighten > out the issue and introduce more clarity. The best results would be to > start closing the tickets with definite conclusions and guidelines, so > that people start using Ruby 1.9 with Rails, so plugin developers in > turn get enough time and feedback to get things right. > > IMPORTANT: I had intention of offending anyone by the following > digests - I just wanted to provide an overview of the lack of > progress, the complexity of issue and the willingness to help, despite > months without progress. I admit I have no idea what prevented the > problem from being solved a long time ago. > > Ticket #2188: > https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038 > 1. Incorrect mention of I18n and #2038 as similar error > 2. Correctly identified problem (Hector E. Gomez Morales) > 3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector) > 4. Unintentional hijacking with a MySQL problem (crazy_bug) > 5. MySQL DB problem redirected to #2476 (Hector) > 6. Unintentional hijacking with a HAML problem (Portfonica) > 7. Jakub Kuźma identifies a wider set of problems > 8. Jakub Kuźma identifies Rack problems > 9. Adam S talks about setting default encoding in Rails > 10. Jérôme points out the need for a default encoding for erb > files > 11. Jeremy Kemper notes that the reports are not really helpful > 12. Rocco Di Leo provides detailed test case, but formatting > problems make it unreadable > 13. Adam S suggests solving the problem by converting ASCII -> > UTF8 > 14. hkstar mentions the lack of progress > 15. Jeremy Kemper notes that the issue still hasn''t been properly > investigated > 16. Turns into a discussion about UTF-8 support in 1.9 > 17. Andrew Grim proposes alternative patch that honors ERB > template encoding > 18. ahaller notes strange behaviour in ERB > 19. Marcello Barnaba proposes general monkey patch for ActionView, > probably related to Rack issues > 20. UVSoft proposes patch for HAML > 21. Alberto describes the problem - just as Hector did > 22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH > > What I propose is combining the two patches above to close this > issue, and give references to non-related tickets which give a > similar error. > > > #Ticket 1988: Make utf8 partial rendering from within a content_for work in ruby1.9 > https://rails.lighthouseapp.com/projects/8994/tickets/1988 > 1. Patch that works around the issue > 2. Jeremy Kemper does not accept the patch due to being utf-8 - only > 3. TICKET STATUS IS INCOMPLETE > > What I propose is solving #2188 first and then investigate this > bug further - it could be a bad assumption about the encoding of > strings returned by tag helpers in a specific case. > > #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1 > https://rails.lighthouseapp.com/projects/8994/tickets/2476 > 1. Hector describe database adaptor problem with 1.9 encodings, > provides a mysql-ruby fork and other links > 2. Patches and fixes for databases / adaptors (James Healy, Jakub > Kuźma, Yugui) > 3. Talk about assuming UTF-8 for databases > 4. Loren Segal proposes hack instead of modifying mysql-ruby > 5. Micheal Hasensein asks about issue 5 months later > 6. UVSoft accidentally posts HAML workaround > 6. TICKET STATUS IS NEW > > My proposal - after fixing #2188, a short description of > adapters/databases and fixed versions could be presented - and > possibly have this issue closed, to prevent it being listed as a > pending UTF-8 issue. Work could be started on validation code for > the strings returned by database adapters and their compatibility > with the "internal" encoding. > > > Open/new tickets related to Rack: > > https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app > https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio > https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors > > My proposal: gather issues and investigate with the help of people > working with non-utf and non-ascii input - I believe Japan is such > a country, where UTF-8 assumptions about Rack input are wrong. > > I would like to thank everyone who invested even the slightest bit of > time in solving this issue. > > I hope the information here will help find a solution that will work > without issues for years to come and that creating Rails applications > will be an enjoyable experience for users, designers, developers, > translators and all contributors, regardless of their environment and > language preferences. > > -- > Cezary Baginski > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > > iEYEARECAAYFAkvMYYMACgkQgEYXSknSpI/llgCfavXgCMfl5ueJPUrwptSil092 > eTEAoK7viEHYiHnmrS5rHXPwmpCAYV8c > =CHR3 > -----END PGP SIGNATURE----- > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
Here are some updates I have sinced I started working on LH #2188 until a patch I submitted there. Although the patch specifically fixes ERB using workarounds in the Rails ERB handler, I tried to make the approach as generic as possible. On Mon, Apr 19, 2010 at 11:30:16AM -0700, Jeremy Kemper wrote:> On Mon, Apr 19, 2010 at 6:58 AM, Czarek <cezary.baginski@gmail.com> wrote:> > The general idea is: allow only one "internal" encoding in Rails at > > any given time, based on the default Ruby encoding (or configurable).I chose Encoding::default_external for this. The short story is that Encoding::default_internal shouldn''t really matter for Rails.> > As a result of enforcing this, all Rails tests should work with any > > encodingProbably the most convenient way to test this is: RUBYOPT=-Ke rake tests See #4466 for an example test script for ActionPack and the trivial fixes that make everything work.> > The guideline for gem and plugin developers would be: do not create or > > return strings (other than internal use) that are not compatible with > > the default encoding both ways. > > > > In some cases, it may be acceptable to drop or escape characters that > > cannot be transcoded (maybe Rack input, for example). > > +1String#{encode,encode!} have both nice options for replacing characters and provide almost all the necessary functionality (force_encoding handles a few other surprise cases). Rack, and converting between incompatible encoding are places where this seems useful.> I don''t know why ERB forces encoding to ASCII-8BIT in the absence of a > magic comment. See r21170. The ERB compiler should probably take a > default source encoding option that''s used if the magic comment is > missing.Two issues are worth mentioning: regexes have their own encoding semantics and force_encoding is actually necessary if you want to "encode" a string to or from ascii-8bit specifically. ERB uses a regex to detect the encoding comment, but the regex has to have the same encoding as the source stream, so ERB uses ASCII-8BIT to be able to run the regex on the stream, regardless of the stream''s encoding. Then ERB continues to use that ASCII-8BIT string for compiling, which seems to be ok, because the strings are passed to eval, with and encoding comment in the beginning... The problem actually lies elsewhere: ERB didn''t detect the encoding, because the encoding magic wasn''t in the first tag. The first tag was added by Rails ERB handler: "<% __in_erb_template=true %><%# encoding ...." Andrew Grim worked this out and created a patch for this in #2188. Should ERB search the whole stream for an encoding tag? Or should Rails guarantee the first tag has the encoding information? I believe the second option will save more time. Erubis is also a reason to forget about patching ERB directly.> Using Erubis is a possibility as well.Patching the ERB problem taught me that although this will solve many encoding issues and headaches, it may unfortunately hide a few general design flaws that should be worked on before Rails 3.0 or Ruby 1.9.2 become production ready. The workarounds I used for patching ERB seem actually quite generic. They allow one to have partials in different encodings and even have ASCII-8BIT as the Ruby default_external without breaking anything. And any encoding incompatibilities occur during encode! calls in the ERB handler - close to the problem. Something similar could be done for db adapters, because just like the template handler being ERB instead od Erubis, people can have old/broken libs, gems and plugins. And since Rails is becoming more modular with 3.0, additional issues may surface, slowing down development in the long run.> > > 3. Treat everything as binary, since the resulting file is sent to a > > browser, which will detect the encoding anyway. This is also doesn''t > > affect performance, but it ruins the whole idea of having encoding > > support, possibly breaking test frameworks instead. > > -1Actually, it turns out that supporting everything as binary takes really no more effort than supporting multiple encoding and it is a good way to test Rails, applications and gems. ASCII-8BIT is the most restrictive when it comes to encoding making it ideal for regression tests. Allowing an application to support ASCII-8BIT through default_external requires more effort, but is worth it.> > 4. Force UTF-8. This is the brute-force idea used in many patches > > and workarounds, and this prevents commits from happening. People > > should have a right to use non-utf8 ERB files and render in any > > encoding e.g. EUC-JP. > > -1Complementary to ASCII-8BIT, UTF-8 is ideal for an ''internal'' encoding and for detecting cases where ASCII-8BIT is (mis)used. UTF-8 should actually *be* used when there are multiple - incompatible otherwise - encodings. Ruby 1.8 just glues anything together, but in 1.9 everything should first be encoded to something as general as UTF-8 before encoded to ASCII-8BIT (if there is such a need). For example, this would allow people to make ISO2022_JP web pages from EUC-JP templates and SJIS databases - by using UTF-8 as the internal encoding. Although choosing UTF-8 seems wrong, in this case it prevents us from loosing encoding information from converting to ASCII-8BIT.> We could set a single default encoding for the app, like we''re doing > in Rails 3.I admit I haven''t even tried Rails 3.0. Shame on me. A single default encoding within rails is a must to gracefully handle the example I gave above (with EUC, SJIS and ISO2022). Of course UTF-8 is reasonable, but there is no reason to assume UTF-8 for all cases.> The ERB compiler is supposed to preserve the input file''s source > encoding unless it has a magic comment. Puzzled why this is necessary. > It should also be fixed in ERB itself, I think.Rails inserts code that breaks ERB''s magic comment detection. How does Erubis handle the issue? Does it regex the stream?> > - obviously, test cases would be executed with different Ruby > > encoding defaults - testing one setup no longer guarantees > > anything. Rails tests should work with almost any default > > encoding, which means testing at least on 3 should be recommended > > before a patch is committed: (BINARY + UTF-8 + EUC ?).Actually, all 5 cases could be used in Rails tests and in apps: - no K option, Ks (sjis), Ke (euc-jp), Ku (utf-8), Kn (binary/ascii-8bit) ActionPack is trivial to fix. Other Rails gems may require more work.> > Ok, good. They''ll need to be rebased against master, and I think > Andrew''s patch breaks some tests since it changes the ERB line > numbers.I haven''t noticed this. Could you provide some details? I am wondering how I missed this. I didn''t check his patch too thoroughly, since I was busy getting a patch #2188 out the door. I only checked my own patch (based on his) on ActionPack and ActiveSupport. Currently, everything seems to work, so let me know if I looked something over.> Rack is woefully lagging on encoding support. It needs an encoding > push of its own. > > Ruby CGI has updated to include just-enough support, e.g. for giving > an encoding for parsed query parameters.I would handle Rack last or at least after Rails tests work in all the encodings. The reason is: I learned not to underestimate encoding problems and leaving Rack for last seems like a good choice.> Indeed! Thanks for leading the charge, Cezary.I''m happy to helpful in some way.> > jeremy-- Cezary Baginski
On Mon, Apr 19, 2010 at 10:28:56PM +0200, Jonas Nicklas wrote:> It''s great to see someone finally take charge of this! I still don''t > have the greatest grasp of character encodings, but what you''re > suggesting sounds good.Thanks :)> Maybe one additional thing: make all generators put the magic comment > with the standard encoding at the top of all source files they create. > Does that sound like a good idea? Should we open a ticket for it?This is a great idea, since people new to Rails usually both are new to Ruby and use generators. The question is how do we choose the encoding? Consider the following: % LC_CTYPE=en_US ruby -e ''p IO.read("_foo.rhtml").encoding'' #<Encoding:US-ASCII> % LC_CTYPE=en_US.UTF-8 ruby -e ''p IO.read("_foo.rhtml").encoding'' #<Encoding:UTF-8> This is important for partials. People will eventually create partials without the encoding information, which will be rendered from templates. I would prefer us-ascii to be used by generators instead of Ruby''s Encoding::default_external for the following reasons: - user may have a non-UTF8 environment, and us-ascii will more likely give an error closer in the call stack to the file without the encoding comment - user shouldn''t really use non ascii characters in partials and templates - i18n is the solution and will help localize the application when it goes global - this would help adopt using ''# encoding: us-ascii'' as a no-brainer solution instead of ''# encoding: utf-8'' which usually just makes problems more obscure The only upside to using UTF-8 at all instead is quickly fixing huge sites with many localized pages, but generators are for new projects anyway. So, by all means, yes, please open a ticket, since this may not be too trivial and encoding issues will more likely need good understanding rather than assuming Rails can and will magically fix everything.> Just to clarify how important this issue is: Rails 2.3 claims to be > Ruby 1.9 compatible, but until this is fixed, even the most trivial of > applications simply don''t work on 1.9, especially if the application > is in a language that often uses non-ASCII characters (pretty much > anything other than English, in other words). This has prevented me > from moving to Ruby 1.9.The m17n support in Ruby > 1.9 is a great concept. Unfortunately balancing: - correctness - performance - robustness in a production environment quickly turns encoding problems into philosophical debates. Without a deep understanding of encoding internal it is too easy to "fix" things by just converting to UTF-8, hiding the real issues. Thanks for bringing this up!> > /Jonas >-- Cezary Baginski
michael.hasenstein@googlemail.com
2010-Apr-25 19:25 UTC
Re: Overview of Ruby 1.9 encoding problem tickets
I disagree. There are lots of apps written for just one specific country without any intention of going global. Besides, one can have locale-specific view files, can''t we? Having "to i18n" each and every string is a little bit too much. Of course, the folks in the US won''t notice, you guys are well off while the rest of the world suffers from such a policy... On Apr 25, 1:01 pm, Czarek <cezary.bagin...@gmail.com> wrote: ....> - user shouldn''t really use non ascii characters in partials and > templates - i18n is the solution and will help localize the > application when it goes global... -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
On Sun, Apr 25, 2010 at 12:25:44PM -0700, michael.hasenstein@googlemail.com wrote:> I disagree. There are lots of apps written for just one specific > country without any intention of going global. Besides, one can have > locale-specific view files, can''t we? Having "to i18n" each and every > string is a little bit too much. Of course, the folks in the US won''t > notice, you guys are well off while the rest of the world suffers from > such a policy...Forgive me for not making the context clear. There is no ''policy'' here, just a suggested generator default behavior for users writing mainly US applications, possibly wishing to easily globalize their applications in the future. In *this* case specifically, my conclusions are: - using utf-8 instead of ascii-us for encoding comments hide problems for those users - people with no experience in encodings other than us-ascii will forget the encoding comments more often than not - Ruby 1.9 chokes when trying to convert two non us-ascii compatible strings - generators could create files with ascii-us by default to prevent the above If that case does not describe your own, chances are you already know what you are doing and Rails gives you all the freedom you can get to adapt things to your own situation, choosing the right tool for the right job. The reason for the proposed generator default is *exactly* to help people unaware of encoding problems to deliver applications that spare others the suffering and grief.> > On Apr 25, 1:01 pm, Czarek <cezary.bagin...@gmail.com> wrote: > .... > > - user shouldn''t really use non ascii characters in partials and > > templates - i18n is the solution and will help localize the > > application when it goes global > ...-- Cezary Baginski
> - user shouldn''t really use non ascii characters in partials and > templates - i18n is the solution and will help localize the > application when it goes global-1 if you know that a rails app will run only within one country within a controllable group (e.g. intranet apps) it does not make much sense adding the overhead of seperate language files.>> Just to clarify how important this issue is: Rails 2.3 claims to be >> Ruby 1.9 compatible, but until this is fixed, even the most trivial of >> applications simply don''t work on 1.9, especially if the application >> is in a language that often uses non-ASCII characters (pretty much >> anything other than English, in other words). This has prevented me >> from moving to Ruby 1.9. > > The m17n support in Ruby > 1.9 is a great concept. Unfortunately > balancing: > - correctness > - performance > - robustness in a production environment > quickly turns encoding problems into philosophical debates. Without a > deep understanding of encoding internal it is too easy to "fix" things > by just converting to UTF-8, hiding the real issues.well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6 even with using utf-8 as a default i had to make various patches within rack to get it up and running. rack: utils # Unescapes a URI escaped string. (Stolen from Camping). def unescape(s) result = s.tr(''+'', '' '').gsub(/((?:%[0-9a-fA-F]{2})+)/n){ [$1.delete(''%'')].pack(''H*'') } RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result end module_function :unescape found at lighthouse... the next one is horrible - i know, but it works for now: def parse_query(qs, d = nil) params = {} (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p| k, v = p.split(''='', 2).map { |x| unescape(x) } begin if v =~ /^("|'')(.*)\1$/ v = $2.gsub(''\\''+$1, $1) end rescue v.force_encoding(''ISO-8859-1'') v.encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') if v =~ /^("|'')(.*)\1$/ v = $2.gsub(''\\''+$1, $1) end end (we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../) the next ugly thing :) def normalize_params(params, name, v = nil) if v and v =~ /^("|'')(.*)\1$/ v = $2.gsub(''\\''+$1, $1) end name =~ %r(\A[\[\]]*([^\[\]]+)\]*) k = $1 || '''' after = $'' || '''' return if k.empty? if after == "" params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) # params[k] = v elsif after == "[]" params[k] ||= [] raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}''" unless params[k].is_a?(Array) params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) # params[k] << v elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$) all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ... Yes / i am glad and thank you that you made this overdue summary! i hope others will have a better start into the ruby1.9 rails 2.3 world as me. In fact there were times i really wondered why someones dares to state that rails is 1.9 compatible for a real world (not real US) app! Thanks a lot! !DSPAM:4bd553b359886468012210! -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
On Mon, Apr 26, 2010 at 10:30:16AM +0200, Paul Sponagl wrote:> > > - user shouldn''t really use non ascii characters in partials and > > templates - i18n is the solution and will help localize the > > application when it goes global > > -1 > > if you know that a rails app will run only within one country within > a controllable group (e.g. intranet apps) it does not make much > sense adding the overhead of seperate language files.I didn''t correctly state what I meant and thank you for helping me realize that :) What I did mean was that users shouldn''t assume non-ascii characters will always work correctly with Ruby 1.9, without specifying encoding comments or assuring specific, correct environment settings. So, let me rephrase myself: Users should not be able to use non-ascii characters in a us-ascii environment without providing an alternative encoding comment or overriding the environment settings. If neither of these are acceptable, i18n is a suggestion. This behavior would be consistent with the way Ruby loads source files. The reason is that doing otherwise can give obscure, hard to track encoding problems, looking like Rails bugs. By supplying a _default_ "us-ascii" encoding comment in generated template files, we help people oblivious to encoding details to do the right thing or do the necessary research (i18n, change encoding comments, localized versions of pages, etc). Encoding problems can be so frustrating, it is easy to perceive US developers as being ignorant. The truth is, it is unusual for them to even experience the problems or reproduce without effort, let alone research ways to test the issues effectively. This feature may slightly help with the latter. Suggestion ---------- I am wondering if Rails could actually assume us-ascii for all types of template files without a specified encoding, or emit a warning unless Ruby is running in full UTF-8 mode (-Ku) or full binary (-Kn). The fix would be adding encoding comments to all the files and may mean a lot of work for existing projects. On the other hand, this is more consistent with how Ruby 1.9 handles source files, so it won''t be a surprise to anyone. This would prevent people from forgetting to put encoding comments in partials, for example. And if this would really be troublesome, people can always stick to Ruby 1.8 or run their servers with -Ku or even -Kn. Would anyone care to comment on this idea?> >> Just to clarify how important this issue is: Rails 2.3 claims to be > >> Ruby 1.9 compatible, but until this is fixed, even the most trivial of > >> applications simply don''t work on 1.9, especially if the application > >> is in a language that often uses non-ASCII characters (pretty much > >> anything other than English, in other words). This has prevented me > >> from moving to Ruby 1.9. > > > > The m17n support in Ruby > 1.9 is a great concept. Unfortunately > > balancing: > > - correctness > > - performance > > - robustness in a production environment > > quickly turns encoding problems into philosophical debates. Without a > > deep understanding of encoding internal it is too easy to "fix" things > > by just converting to UTF-8, hiding the real issues. > > well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6 > even with using utf-8 as a default i had to make various patches within rack to get it up and running. > > rack: utils > > # Unescapes a URI escaped string. (Stolen from Camping). > def unescape(s) > result = s.tr(''+'', '' '').gsub(/((?:%[0-9a-fA-F]{2})+)/n){ > [$1.delete(''%'')].pack(''H*'') > } > RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result > end > module_function :unescape > > found at lighthouse... > > > > > the next one is horrible - i know, but it works for now: > > def parse_query(qs, d = nil) > params = {} > > (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p| > k, v = p.split(''='', 2).map { |x| unescape(x) } > begin > if v =~ /^("|'')(.*)\1$/ > v = $2.gsub(''\\''+$1, $1) > end > rescue > v.force_encoding(''ISO-8859-1'') > v.encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') > if v =~ /^("|'')(.*)\1$/ > v = $2.gsub(''\\''+$1, $1) > end > end > > (we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../) > > > > the next ugly thing :) > > def normalize_params(params, name, v = nil) > if v and v =~ /^("|'')(.*)\1$/ > v = $2.gsub(''\\''+$1, $1) > end > name =~ %r(\A[\[\]]*([^\[\]]+)\]*) > k = $1 || '''' > after = $'' || '''' > > return if k.empty? > > > if after == "" > params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) > # params[k] = v > elsif after == "[]" > params[k] ||= [] > raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}''" unless params[k].is_a?(Array) > params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) > # params[k] << v > elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$) > > > all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ... > > > Yes / i am glad and thank you that you made this overdue summary! > > i hope others will have a better start into the ruby1.9 rails 2.3 > world as me. In fact there were times i really wondered why > someones dares to state that rails is 1.9 compatible for a real > world (not real US) app! > > Thanks a lot!And thank you too for helping out! Especially for giving the summary of rack issues with patches, which obviously saved me hours of research. It make be a while before Rails 2.3 becomes 1.9 compatible as the result of detailed test cases and well thought out, politically correct patches, but it is so encouraging to see Rails users not giving up! Thanks again! -- Cezary Baginski
i am very busy in writing new features for our project so i do not have the time/brainspace/brainpower now to think about clean solutions, but just one more - might help you too: in my application_controller i added the very very very bad (i know - do not blame me - its working :) charset test for routes with specialized chars as i wanted to use paths with umlauts too. if a browser/search-bot defaults to ISO requesting www.domain.com/über will obviously break things when using "über".force_encoding(''utf-8'') within rails... REGEXP_ISO = Regexp.new(''[^\xc3][\xe4\xf6\xfc\xc4\xd6\xdc\xdf]'', nil, ''n'') REGEXP_MACROMAN = Regexp.new(''[^\xc3][\x8a\x9a\x9f\x80\x85\x86\xa7]'', nil, ''n'') def check_params_encoding( key ) unless params[key].blank? params[key].force_encoding(''ASCII-8BIT'') if params[key].match(REGEXP_ISO) params[key].force_encoding(''ISO-8859-1'') params[key].encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') elsif params[key].match(REGEXP_MACROMAN) params[key].force_encoding(''macRoman'') params[key].encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') end end params[key].force_encoding(Encoding::UTF_8) end btw. i switched in one step from ruby 1.8.7 => ruby 1.9 backgroundrb => delayed_job ferret => sphinx thin => unicorn 2.3.2 => 2.3.6 (memcached as frontend cache) and i have to say (after blood, sweat and tears exceptions on the production servers leading to those quick hacks ;) IT ROCKS :) ! no more aaf ferret issues - fast searches - slim job workers - painless fast restarts - no more (un)fair balancing ! good luck for your projects with rails and all the best on your 1.9 travel ! Paul Am 26.04.2010 um 12:55 schrieb Czarek:> On Mon, Apr 26, 2010 at 10:30:16AM +0200, Paul Sponagl wrote: >> >>> - user shouldn''t really use non ascii characters in partials and >>> templates - i18n is the solution and will help localize the >>> application when it goes global >> >> -1 >> >> if you know that a rails app will run only within one country within >> a controllable group (e.g. intranet apps) it does not make much >> sense adding the overhead of seperate language files. > > I didn''t correctly state what I meant and thank you for helping me > realize that :) > > What I did mean was that users shouldn''t assume non-ascii characters > will always work correctly with Ruby 1.9, without specifying encoding > comments or assuring specific, correct environment settings. So, let > me rephrase myself: > > Users should not be able to use non-ascii characters in a us-ascii > environment without providing an alternative encoding comment or > overriding the environment settings. If neither of these are > acceptable, i18n is a suggestion. > > This behavior would be consistent with the way Ruby loads source > files. The reason is that doing otherwise can give obscure, hard to > track encoding problems, looking like Rails bugs. > > By supplying a _default_ "us-ascii" encoding comment in generated > template files, we help people oblivious to encoding details to do the > right thing or do the necessary research (i18n, change encoding > comments, localized versions of pages, etc). > > Encoding problems can be so frustrating, it is easy to perceive US > developers as being ignorant. The truth is, it is unusual for them to > even experience the problems or reproduce without effort, let alone > research ways to test the issues effectively. This feature may > slightly help with the latter. > > Suggestion > ---------- > > I am wondering if Rails could actually assume us-ascii for all types > of template files without a specified encoding, or emit a warning > unless Ruby is running in full UTF-8 mode (-Ku) or full binary (-Kn). > > The fix would be adding encoding comments to all the files and may > mean a lot of work for existing projects. On the other hand, this is > more consistent with how Ruby 1.9 handles source files, so it won''t be > a surprise to anyone. > > This would prevent people from forgetting to put encoding comments in > partials, for example. And if this would really be troublesome, people > can always stick to Ruby 1.8 or run their servers with -Ku or even > -Kn. > > Would anyone care to comment on this idea? > >>>> Just to clarify how important this issue is: Rails 2.3 claims to be >>>> Ruby 1.9 compatible, but until this is fixed, even the most trivial of >>>> applications simply don''t work on 1.9, especially if the application >>>> is in a language that often uses non-ASCII characters (pretty much >>>> anything other than English, in other words). This has prevented me >>>> from moving to Ruby 1.9. >>> >>> The m17n support in Ruby > 1.9 is a great concept. Unfortunately >>> balancing: >>> - correctness >>> - performance >>> - robustness in a production environment >>> quickly turns encoding problems into philosophical debates. Without a >>> deep understanding of encoding internal it is too easy to "fix" things >>> by just converting to UTF-8, hiding the real issues. >> >> well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6 >> even with using utf-8 as a default i had to make various patches within rack to get it up and running. >> >> rack: utils >> >> # Unescapes a URI escaped string. (Stolen from Camping). >> def unescape(s) >> result = s.tr(''+'', '' '').gsub(/((?:%[0-9a-fA-F]{2})+)/n){ >> [$1.delete(''%'')].pack(''H*'') >> } >> RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result >> end >> module_function :unescape >> >> found at lighthouse... >> >> >> >> >> the next one is horrible - i know, but it works for now: >> >> def parse_query(qs, d = nil) >> params = {} >> >> (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p| >> k, v = p.split(''='', 2).map { |x| unescape(x) } >> begin >> if v =~ /^("|'')(.*)\1$/ >> v = $2.gsub(''\\''+$1, $1) >> end >> rescue >> v.force_encoding(''ISO-8859-1'') >> v.encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') >> if v =~ /^("|'')(.*)\1$/ >> v = $2.gsub(''\\''+$1, $1) >> end >> end >> >> (we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../) >> >> >> >> the next ugly thing :) >> >> def normalize_params(params, name, v = nil) >> if v and v =~ /^("|'')(.*)\1$/ >> v = $2.gsub(''\\''+$1, $1) >> end >> name =~ %r(\A[\[\]]*([^\[\]]+)\]*) >> k = $1 || '''' >> after = $'' || '''' >> >> return if k.empty? >> >> >> if after == "" >> params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) >> # params[k] = v >> elsif after == "[]" >> params[k] ||= [] >> raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}''" unless params[k].is_a?(Array) >> params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v) >> # params[k] << v >> elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$) >> >> >> all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ... >> >> >> Yes / i am glad and thank you that you made this overdue summary! >> >> i hope others will have a better start into the ruby1.9 rails 2.3 >> world as me. In fact there were times i really wondered why >> someones dares to state that rails is 1.9 compatible for a real >> world (not real US) app! >> >> Thanks a lot! > > And thank you too for helping out! Especially for giving the summary > of rack issues with patches, which obviously saved me hours of > research. > > It make be a while before Rails 2.3 becomes 1.9 compatible as the > result of detailed test cases and well thought out, politically > correct patches, but it is so encouraging to see Rails users not > giving up! > > Thanks again! >!DSPAM:4bd5832759881659720503! -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
yesterday i found another situation where i got hit by the 1.9 encoding problems. Believe it or not i''ve seen a case at our site where IE8 sends ISO-encoded uris after recieving the page incl. the link in UTF-8. I thought this is a save one - but it is not! Now i decided to add a very simple module force_recoding within lib (find it below / yes, could / should?! be a - rchardet like - native?! kernel method) and patch rack utils and rails - request. btw. rchardet even in the 1.9 version of http://github.com/speedmax/rchardet did not work. for now i would say - do not use rails with 1.9 outside the us unless you have fun debugging on production servers - and make sure that exception_notification works! - this last error prevented it from sending mails as erb got crazy while spitting the iso string into an utf-8 context ... i was informed by users ... smells like 1995 ... and please rails core - write down the encoding problems within "Improved compatibility with Ruby 1.9" at http://weblog.rubyonrails.org/2009/11/30/ruby-on-rails-2-3-5-released and help newcomers get the right trail to rails! Now that i am working with rails for about 3 years - i can say i have at least a bit of experience - a newcomer will never use rails again when facing this kind of hard to track down errors. (i.m.O. only segfaults could be worse!) i patched rack/utils.rb: ---------------8< ------------ # -*- encoding: binary -*- require ''set'' require ''tempfile'' +require ''force_recoding'' module Rack # Rack::Utils contains a grab-bag of useful methods for writing web # applications adopted from all kinds of Ruby libraries. module Utils + + include ForceRecoding + module_function :force_recoding + # Performs URI escaping so that you can construct proper # query strings faster. Use this rather than the cgi.rb # version since it''s faster. (Stolen from Camping). @@ -21,9 +25,10 @@ # Unescapes a URI escaped string. (Stolen from Camping). def unescape(s) - s.tr(''+'', '' '').gsub(/((?:%[0-9a-fA-F]{2})+)/n){ + result = s.tr(''+'', '' '').gsub(/((?:%[0-9a-fA-F]{2})+)/n){ [$1.delete(''%'')].pack(''H*'') } + result = force_recoding( result ) end module_function :unescape @@ -32,16 +37,23 @@ # Stolen from Mongrel, with some small modifications: # Parses a query string by breaking it up at the ''&'' # and '';'' characters. You can also use this to parse # cookies by changing the characters used in the second # parameter (which defaults to ''&;''). def parse_query(qs, d = nil) params = {} (qs || '''').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p| k, v = p.split(''='', 2).map { |x| unescape(x) } + begin + if v =~ /^("|'')(.*)\1$/ + v = $2.gsub(''\\''+$1, $1) + end + rescue + v = force_recoding( v ) if v =~ /^("|'')(.*)\1$/ v = $2.gsub(''\\''+$1, $1) end + end if cur = params[k] if cur.class == Array params[k] << v @@ -79,12 +91,15 @@ return if k.empty? + if after == "" - params[k] = v + params[k] = force_recoding( v ) + # params[k] = v elsif after == "[]" params[k] ||= [] raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}''" unless params[k].is_a?(Array) - params[k] << v + params[k] << force_recoding( v ) + # params[k] << v elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$) child_key = $1 params[k] ||= [] and now within rails action_controller request.rb: ---------------8< ------------ include ForceRecoding # Returns the query string, accounting for server idiosyncrasies. def query_string @env[''QUERY_STRING''].present? ? force_recoding(@env[''QUERY_STRING'']) : (force_recoding(@env[''REQUEST_URI'']).split(''?'', 2)[1] || '''') end # Returns the request URI, accounting for server idiosyncrasies. # WEBrick includes the full URL. IIS leaves REQUEST_URI blank. def request_uri if uri = force_recoding(@env[''REQUEST_URI'']) # Remove domain, which webrick puts into the request_uri. (%r{^\w+\://[^/]+(/.*|$)$} =~ uri) ? $1 : uri else # Construct IIS missing REQUEST_URI from SCRIPT_NAME and PATH_INFO. uri = force_recoding(@env[''PATH_INFO'']).to_s if script_filename = @env[''SCRIPT_NAME''].to_s.match(%r{[^/]+$}) uri = uri.sub(/#{script_filename}\//, '''') end env_qs = force_recoding(@env[''QUERY_STRING'']).to_s uri += "?#{env_qs}" unless env_qs.empty? if uri.blank? @env.delete(''REQUEST_URI'') else @env[''REQUEST_URI''] = uri end end end here is the module: ---------------8< ------------ module ForceRecoding REGEXP_ISO = Regexp.new(''[^\xc3][\xe4\xf6\xfc\xc4\xd6\xdc\xdf]'', nil, ''n'') REGEXP_MACROMAN = Regexp.new(''[^\xc3][\x8a\x9a\x9f\x80\x85\x86\xa7]'', nil, ''n'') def force_recoding( str ) return str if RUBY_VERSION < "1.9" || str.nil? || !str.is_a?(String) unless str.blank? str.force_encoding(''ASCII-8BIT'') if str.match(REGEXP_ISO) str.force_encoding(''ISO-8859-1'') str.encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') elsif str.match(REGEXP_MACROMAN) str.force_encoding(''macRoman'') str.encode!(''UTF-8'',:invalid => :replace, :undef => :replace, :replace => '''') end end str.force_encoding(Encoding::UTF_8) str end end !DSPAM:4bd6d1e259885908015648! -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.