Manfred Stienstra
2006-Sep-20 13:03 UTC
ActiveSupport::Multibyte for better Unicode support
Three months ago Julian Tarkhanov submitted a test implementation of his ActiveSupport::Multibyte string extension patch. Since then we''ve been steadily improving the extension based on the feedback we received. The code has been completely refactored to be more transparent and easier to understand. There is now a single optional accelerated backend and all multibyte-safe operations have a pure Ruby implementation. Test structure and coverage has also been greatly improved. ActiveSupport::Multibyte is available as a plugin and can be converted to a patch using the included ''create_patch'' rake task. We would like to see ActiveSupport::Multibyte included in Rails so that developers can start depending on it for simpler and better Unicode support. The ticket for the patch is at http://dev.rubyonrails.org/ticket/ 6242. More information and code can be found at https://fngtps.com/ projects/multibyte_for_rails. Manfred --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
> We would like to see ActiveSupport::Multibyte included in Rails so > that developers can start depending on it for simpler and better > Unicode support.I concur. Let this start an official request for comments. Any objections to getting this into core? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-23 06:08 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/21/06, DHH <david.heinemeier@gmail.com> wrote:> > > We would like to see ActiveSupport::Multibyte included in Rails so > > that developers can start depending on it for simpler and better > > Unicode support. > > I concur. Let this start an official request for comments. Any > objections to getting this into core?I''m definitely keen to see this get added. However I''m a bit concerned about the lack of discussion in this thread. It''s a big piece of work, and I was hoping more people would have opinions on it -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Manfred Stienstra
2006-Sep-23 07:08 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 23, 2006, at 8:08 AM, Michael Koziarski wrote:> I''m definitely keen to see this get added. However I''m a bit > concerned about the lack of discussion in this thread. It''s a big > piece of work, and I was hoping more people would have opinions on itI think that''s the problem, because the codebase is pretty esotheric not much people want to dive in and give their opinion. I could explain on a global level, without gettting into all the details concerning encoding, what it does and what decisions were made during coding if anyone is interested. Manfred --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Peter Michaux
2006-Sep-23 07:19 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/23/06, Manfred Stienstra <manfred@gmail.com> wrote:> > On Sep 23, 2006, at 8:08 AM, Michael Koziarski wrote: > > > I''m definitely keen to see this get added. However I''m a bit > > concerned about the lack of discussion in this thread. It''s a big > > piece of work, and I was hoping more people would have opinions on it > > I think that''s the problem, because the codebase is pretty esotheric > not much people want to dive in and give their opinion. I could > explain on a global level, without gettting into all the details > concerning encoding, what it does and what decisions were made during > coding if anyone is interested.I''m interested in a general overview on what problem it fixes and why it is needed. I don''t know much about the whole unicode problem with Ruby people keep bringing up and then other say it isn''t really a problem. Peter --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Mathieu Jobin
2006-Sep-23 08:15 UTC
Re: ActiveSupport::Multibyte for better Unicode support
The ticket description already seems to be a very good general overview. if my opinion count and this package has been well tested, I''d say "Add please". although if it only patches ruby, not rails, it could be a separate gems or a patch on ruby core/stdlib Mathieu On 9/23/06, Peter Michaux <petermichaux@gmail.com> wrote:> > > On 9/23/06, Manfred Stienstra <manfred@gmail.com> wrote: > > > > On Sep 23, 2006, at 8:08 AM, Michael Koziarski wrote: > > > > > I''m definitely keen to see this get added. However I''m a bit > > > concerned about the lack of discussion in this thread. It''s a big > > > piece of work, and I was hoping more people would have opinions on it > > > > I think that''s the problem, because the codebase is pretty esotheric > > not much people want to dive in and give their opinion. I could > > explain on a global level, without gettting into all the details > > concerning encoding, what it does and what decisions were made during > > coding if anyone is interested. > > I''m interested in a general overview on what problem it fixes and why > it is needed. I don''t know much about the whole unicode problem with > Ruby people keep bringing up and then other say it isn''t really a > problem. > > Peter > > > >-- gcc -O0 -DRUBY_EXPORT -rdynamic -Wl,-export-dynamic -L. main.o -lruby-static -ldl -lcrypt -lm -o ruby Everyone is trying their hardest to do their job but management has set it up so that it''s impossible. Take the control over your money, track your expenses http://justbudget.com Mathieu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Manfred Stienstra
2006-Sep-23 08:30 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 23, 2006, at 10:15 AM, Mathieu Jobin wrote:> The ticket description already seems to be a very good general > overview. > if my opinion count and this package has been well tested, I''d say > "Add please". > although if it only patches ruby, not rails, it could be a separate > gems or a patch on ruby core/stdlibMatz claims that Ruby currently has enough tools to deal with encoding. The problem is that you have to be an expert to do it right. The earliest Ruby is going to deal with encoding is in Rails 2.0 and that''s not going to come out really soon. So this leaves the encoding problem with the application programmers. Even though I have to admit that I would rather see a good solution in Ruby core or in a stdlib, it''s not going to happen. ActiveSupport::Multibyte is an attempt to make dealing with encoding simpler for the Rails (core) programmer, right now. It could also work as a deprecation mechanism when/if support for Ruby comes out. If ActiveSupport::Multibyte would be released as a gem or standalone library, Rails code can''t depend on it and we''d have to litter the code with if statements. Manfred --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Mathieu Jobin
2006-Sep-23 08:38 UTC
Re: ActiveSupport::Multibyte for better Unicode support
make total sense. thanks On 9/23/06, Manfred Stienstra <manfred@gmail.com> wrote:> > > > On Sep 23, 2006, at 10:15 AM, Mathieu Jobin wrote: > > > The ticket description already seems to be a very good general > > overview. > > if my opinion count and this package has been well tested, I''d say > > "Add please". > > although if it only patches ruby, not rails, it could be a separate > > gems or a patch on ruby core/stdlib > > Matz claims that Ruby currently has enough tools to deal with > encoding. The problem is that you have to be an expert to do it > right. The earliest Ruby is going to deal with encoding is in Rails > 2.0 and that''s not going to come out really soon. So this leaves the > encoding problem with the application programmers. Even though I have > to admit that I would rather see a good solution in Ruby core or in a > stdlib, it''s not going to happen. ActiveSupport::Multibyte is an > attempt to make dealing with encoding simpler for the Rails (core) > programmer, right now. It could also work as a deprecation mechanism > when/if support for Ruby comes out. > > If ActiveSupport::Multibyte would be released as a gem or standalone > library, Rails code can''t depend on it and we''d have to litter the > code with if statements. > > Manfred > > > > >-- gcc -O0 -DRUBY_EXPORT -rdynamic -Wl,-export-dynamic -L. main.o -lruby-static -ldl -lcrypt -lm -o ruby Everyone is trying their hardest to do their job but management has set it up so that it''s impossible. Take the control over your money, track your expenses http://justbudget.com Mathieu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Mislav Marohnić
2006-Sep-23 14:10 UTC
Re: ActiveSupport::Multibyte for better Unicode support
Peter, The problems is correctly supporting multibyte strings. Unicode, the most complete character set, has several encodings (UTF-8 being the most popular one), each of them having some (or all) characters expressed with two or more bytes (unlike ASCII, for instance). In UTF-8, "abc" is a three-character string encoded in 3 bytes, but "čžš" (3 characters from Croatian alphabet) are encoded in 6 bytes (2 bytes each). Multibyte-unaware programming languages (like Ruby and PHP < 6) assume 1 character = 1 byte. In Ruby, try string.reverse or string.length on strings containing special characters to see some unexpected results. Reverse will corrupt the string while length will report in bytes, not in characters. These are trivial examples, while the problem goes much deeper. Rails needs this. -- Mislav On 9/23/06, Peter Michaux <petermichaux@gmail.com> wrote:> > I'm interested in a general overview on what problem it fixes and why > it is needed. I don't know much about the whole unicode problem with > Ruby people keep bringing up and then other say it isn't really a > problem. > > Peter >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Charles O Nutter
2006-Sep-23 14:43 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/20/06, Manfred Stienstra <manfred@gmail.com> wrote:> > Three months ago Julian Tarkhanov submitted a test implementation of > his ActiveSupport::Multibyte string extension patch. Since then we''ve > been steadily improving the extension based on the feedback we received.It appears this doesn''t have any native/C code, but can you confirm that in case I''m not looking hard enough? Obviously we JRubyists wouldn''t want anything in Rails to start requiring code we can''t run. -- Contribute to RubySpec! @ www.headius.com/rubyspec Charles Oliver Nutter @ headius.blogspot.com Ruby User @ ruby.mn --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Charles O Nutter wrote:> On 9/20/06, Manfred Stienstra <manfred@gmail.com> wrote: >> Three months ago Julian Tarkhanov submitted a test implementation of >> his ActiveSupport::Multibyte string extension patch. Since then we''ve >> been steadily improving the extension based on the feedback we received. > > It appears this doesn''t have any native/C code, but can you confirm > that in case I''m not looking hard enough? Obviously we JRubyists > wouldn''t want anything in Rails to start requiring code we can''t run.How does JRuby handle strings? If they are mapped to java.lang.String, the JRuby already has more than adequate Unicode support. It seems to me that .chars should return back the same object, if the underlying VM supports Unicode. I would guess that today that would include JRuby, and in the future, that would include Ruby 2.0. Some day in the future, when Ruby 1.x is a distant memory, .chars should be deprecated, and ultimately removed. - Sam Ruby --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-23 22:45 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> Some day in the future, when Ruby 1.x is a distant memory, .chars should > be deprecated, and ultimately removed.That''s definitely our intention, if JRuby is using java.lang.String, then a simple plugin which does the following would be sufficient. class String def chars self end end We''ll update ActiveSupport to contain that (with appropriate deprecation) when ruby 2.x comes to the party. -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-24 01:21 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 21/09/2006, at 12:15 AM, DHH wrote:>> We would like to see ActiveSupport::Multibyte included in Rails so >> that developers can start depending on it for simpler and better >> Unicode support. > > I concur. Let this start an official request for comments. Any > objections to getting this into core?I''m definitely in favour of seeing something like this in core. Better unicode handling is needed yesterday! The chars proxy is a very nice way of handling this. A question: How does this compare to the unicode_hacks plugin? (See http:// julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They seem very similar in both intent and interface. Some comments: Even with this plugin, supporting unicode in a Rails app is too complicated and fiddly. For those who haven''t tried it, here are the steps: - Make sure your database character set is utf8 - Make sure all your tables have a character set of utf8 - Make sure your database.yml has ''encoding: utf8'' set for each database - Put $KCODE=''u'' in your environment.rb - Add an after_filter to application.rb to set the Content-Type header correctly - Add ''normalize_unicode_params :form => :kc'' to your application.rb Missing one of these steps can produce strange results and corrupted data. If unicode support is being included in core, then this needs to be rationalised. Ideally a single setting in environment.rb should take care of all of this. I also think it should be enabled by default. (Who doesn''t want to support unicode nowadays?) Rumour also has it that ActiveRecord, when recreating timed-out database connections, doesn''t honour the ''encoding: utf8'' setting. I''ve never run into this personally, so I assume it was fixed at some point? Cheers, Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-24 11:07 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 23 Sep 2006, at 16:43 , Charles O Nutter wrote:> On 9/20/06, Manfred Stienstra <manfred@gmail.com> wrote: >> Three months ago Julian Tarkhanov submitted a test implementation of >> his ActiveSupport::Multibyte string extension patch. Since then we''ve >> been steadily improving the extension based on the feedback we >> received. > > It appears this doesn''t have any native/C code, but can you confirm > that in case I''m not looking hard enough?Confirmed. All operations are implemented as pure Ruby. Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Joshua Sierles
2006-Sep-24 12:20 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> - Make sure your database character set is utf8 > - Make sure all your tables have a character set of utf8 > - Make sure your database.yml has ''encoding: utf8'' set for each databaseNone of these steps are required officially unless you use utf-8 specific features of the database (collation). The last setting seems to set the connection encoding, which shouldn''t be required unless there is non-utf8 data stored in the database.> - Put $KCODE=''u'' in your environment.rbThis is only required if you use unicode strings in your Ruby code. - Add an after_filter to application.rb to set the Content-Type header correctly Rails now defaults to utf-8 Content-Type. Joshua Sierles --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-24 18:35 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 24 Sep 2006, at 03:21 , Pete Yandell wrote:> How does this compare to the unicode_hacks plugin? (See http:// > julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They > seem very similar in both intent and interface.ActiveSupport::Multibyte is a component of the Multibyte for Rails project which is basically the next version of the unicode_hacks plugin.should take Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Pete Yandell
2006-Sep-25 01:58 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 24/09/2006, at 10:20 PM, Joshua Sierles wrote:>> - Make sure your database character set is utf8 >> - Make sure all your tables have a character set of utf8 >> - Make sure your database.yml has ''encoding: utf8'' set for each >> database > > None of these steps are required officially unless you use utf-8 > specific features of the database (collation). The last setting seems > to set the connection encoding, which shouldn''t be required unless > there is non-utf8 data stored in the database.Not true! Collation and character set are separate things. There are a couple of obvious reasons you want your database character set to be UTF8 if you''re storing UTF8 strings: 1. When you access the database through the mysql (or pgsql, or other) command line, or through tools such as CocoaMySQL, you want strings to display properly. 2. MySQL never treats strings as binary; they always have a character set, which is latin1 (CP1252) by default. Putting UTF8 data into fields marked as latin1 seems like asking for trouble. (There are some byte values that are invalid in CP1252, so technically strings containing those bytes are illegal. It''s only through MySQL''s laziness in not checking the strings when the connection and table character sets match up that you can get away with this at all.) There are even worse potential pitfalls here too. On one of our projects, we did everything except set the the connection encoding. What happened was that a UTF8 string in Rails would be regarded as CP1252 by MySQL, but MySQL knew that the tables needed UTF8, so it did a CP1252 to UTF8 conversion on the (already UTF8) string before writing it. As you can imagine, we ended up with all sorts of crap in the database, and the occasional string got completely munged as invalid CP1252 bytes were replaced with question marks. These three things should at least be reduced to a single setting to avoid mistakes. I can''t imagine a situation in which you would want to do one of them without the others.>> - Put $KCODE=''u'' in your environment.rb > > This is only required if you use unicode strings in your Ruby code.If your app handles UTF8, then you''re going to want to write tests involving UTF8 strings, so you''re going to need this turned on. You do write UTF8 tests for your apps, right? :)> - Add an after_filter to application.rb to set the Content-Type > header correctly > > Rails now defaults to utf-8 Content-Type.Good to know. I''ll take this as an endorsement of the idea the UTF8 should be the default for Rails apps. :) Cheers, Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
David Goodlad
2006-Sep-25 03:06 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/24/06, Pete Yandell <pete.yandell@gmail.com> wrote:> Good to know. I''ll take this as an endorsement of the idea the UTF8 > should be the default for Rails apps. :)I have to put in my two cents here. I can''t see any reason why one _wouldn''t_ want to use UTF-8 over plain-ol'' ASCII. It''s a totally different ball game than localization; I just want my users to be able to input data using their own native characters. What app doesn''t have a "full name" field for a user? Shouldn''t your users be able to input their name properly? :) Besides implementation issues, I can''t see any real downside to supporting UTF-8 out of the box in Rails. It would sure avoid a lot of potential issues... Dave -- Dave Goodlad dgoodlad@gmail.com or dave@goodlad.ca http://david.goodlad.ca/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-25 14:48 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 24-sep-2006, at 3:21, Pete Yandell wrote:> > How does this compare to the unicode_hacks plugin? (See http:// > julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They > seem very similar in both intent and interface.It''s a sancitioned evolution thereof. Manfred and Thijs overtook the business while I am plowing through my internship (which BTW has nothing to do with Rails and we-development). We split the repositories so that they can perform exhaustive code changes without hurting everyone sitting on unicode_hacks. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-25 14:49 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 25-sep-2006, at 5:06, David Goodlad wrote:> I can''t see any real downside to > supporting UTF-8 out of the box in Rails.Tell it to the Japanese and the Chinese railers. I wonder how long you will stand before you get your ass served :-) -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
David Goodlad
2006-Sep-25 16:06 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/25/06, Julian ''Julik'' Tarkhanov <listbox@julik.nl> wrote:> > > On 25-sep-2006, at 5:06, David Goodlad wrote: > > > I can''t see any real downside to > > supporting UTF-8 out of the box in Rails. > > Tell it to the Japanese and the Chinese railers. I wonder how long > you will stand before you get your ass served :-)You mean they would get mad if Rails _did_ support UTF-8 out of the box? Dave -- Dave Goodlad dgoodlad@gmail.com or dave@goodlad.ca http://david.goodlad.ca/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-25 22:17 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> You mean they would get mad if Rails _did_ support UTF-8 out of the box?Yeah, UTF-8 and unicode aren''t terribly popular in japan. For more information than you ever thought you''d want, you can read up on the Han unification. It''s also much less efficient (space wise) than their ''legacy'' encodings. -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-25 22:30 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26/09/2006, at 12:49 AM, Julian ''Julik'' Tarkhanov wrote:> On 25-sep-2006, at 5:06, David Goodlad wrote: > >> I can''t see any real downside to >> supporting UTF-8 out of the box in Rails. > > Tell it to the Japanese and the Chinese railers. I wonder how long > you will stand before you get your ass served :-)Why? It''s not like Rails supports Japanese or Chinese encodings out of the box now. How is going from supporting just ASCII to supporting UTF-8 taking anything away from Japanese or Chinese railers? Like David said, what exactly is the downside to default UTF-8 support? Who does it hurt, how, and why? Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski wrote:>> You mean they would get mad if Rails _did_ support UTF-8 out of the box? > > Yeah, UTF-8 and unicode aren''t terribly popular in japan. For more > information than you ever thought you''d want, you can read up on the > Han unification. It''s also much less efficient (space wise) than > their ''legacy'' encodings.Java and C# seem to do OK in Japan. I would also imagine that ASCII wouldn''t be very popular in Japan. :-) - Sam Ruby --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-25 23:24 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> Java and C# seem to do OK in Japan. > > I would also imagine that ASCII wouldn''t be very popular in Japan. :-)I should clarify, don''t take my previous statement as disagreeing with "utf-8 everywhere", I''m for it, not against it. But it''s definitely not as simple an issue as it appears at first glance ;) -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Manfred Stienstra
2006-Sep-26 06:41 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 26, 2006, at 12:17 AM, Michael Koziarski wrote:> >> You mean they would get mad if Rails _did_ support UTF-8 out of >> the box? > > Yeah, UTF-8 and unicode aren''t terribly popular in japan. For more > information than you ever thought you''d want, you can read up on the > Han unification. It''s also much less efficient (space wise) than > their ''legacy'' encodings.ActiveSupport::Multibyte doesn''t favor any encoding. It currently implements UTF-8 operations because that''s what we, and a lot of other people on the web, use daily. We believe that you shouldn''t implement anything you''re not going to use yourself. This is also explained on our Trac page, in the FAQ. https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ Manfred --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-26 07:12 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26 Sep 2006, at 01:15 , Sam Ruby wrote:> Michael Koziarski wrote: >>> You mean they would get mad if Rails _did_ support UTF-8 out of >>> the box? >> >> Yeah, UTF-8 and unicode aren''t terribly popular in japan. For more >> information than you ever thought you''d want, you can read up on >> the Han unification. It''s also much less efficient (space wise) >> than their ''legacy'' encodings. > > Java and C# seem to do OK in Japan.And for good reason. I have yet to see an example of something that you can do in Shift-JIS and EUC that you can''t do with Unicode 5 encoded as UTF-8. I''m not saying there are no issues some people feel strongly about, but there are certainly no compelling technical or practical reasons why you can''t use Unicode in Japan. Even so, Ruby supports Shift-JIS and EUC and will continue to. Because Rails gets so much out of Ruby it would be somewhat rude if the next Rails release were to make it impossible to use these encoding. That''s _exactly_ why ActiveSupport::Multibyte is designed to support multiple encodings. The only reason Shift-JIS and EUC are currently not implemented in ActiveSupport::Multibyte is that we don''t feel comfortable building stuff we don''t use. So, if you need Shift-JIS or EUC, please add it to ActiveSupport::Multibyte and send us a patch. For more information see the Multibyte for Rails FAQ: https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Thijs van der Vossen
2006-Sep-26 07:44 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26 Sep 2006, at 24:30 , Pete Yandell wrote:> On 26/09/2006, at 12:49 AM, Julian ''Julik'' Tarkhanov wrote: >> On 25-sep-2006, at 5:06, David Goodlad wrote: >>> I can''t see any real downside to supporting UTF-8 out of the box >>> in Rails. >> >> Tell it to the Japanese and the Chinese railers. I wonder how long >> you will stand before you get your ass served :-) > > Why? It''s not like Rails supports Japanese or Chinese encodings out > of the box now. How is going from supporting just ASCII to > supporting UTF-8 taking anything away from Japanese or Chinese > railers? > > Like David said, what exactly is the downside to default UTF-8 > support? Who does it hurt, how, and why?There''s no downside to default UTF-8 support, but it would be nice if switching from the default to Shift-JIS or EUC is going to be as easy as changing $KCODE = ''utf-8'' to $KCODE = ''sjis''. If you want this, please add Shift-JIS and/or EUC support in ActiveSupport::Multibyte and send us a patch. Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Michael Koziarski
2006-Sep-26 07:49 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> So, if you need Shift-JIS or EUC, please add it to > ActiveSupport::Multibyte and send us a patch.Other encodings can be support with plugins initially, I''m personally happy with utf-8 only as a position for 1.2. -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-26 08:06 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> > - Make sure your database character set is utf8 > > - Make sure all your tables have a character set of utf8 > > - Make sure your database.yml has ''encoding: utf8'' set for each database > > None of these steps are required officially unless you use utf-8 > specific features of the database (collation). The last setting seems > to set the connection encoding, which shouldn''t be required unless > there is non-utf8 data stored in the database. > > > - Put $KCODE=''u'' in your environment.rb > > This is only required if you use unicode strings in your Ruby code. > > - Add an after_filter to application.rb to set the Content-Type > header correctly > > Rails now defaults to utf-8 Content-Type.So, if we merged in ActiveSupport::Multibyte, and updated helpers like truncate to use the chars proxy, what other changes would be required to make this stuff simple? Normalisation of input parameters? Anything else? It would be nice if we could make it really easy to have this stuff ''just work'' without much in the way of additional user intervention. -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Manfred Stienstra
2006-Sep-26 08:27 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 10:06, Michael Koziarski wrote:> So, if we merged in ActiveSupport::Multibyte, and updated helpers > like truncate to use the chars proxy, what other changes would be > required to make this stuff simple? Normalisation of input > parameters? Anything else?Well, Normalization of input parameters depends on the situation. If you want to compare strings you probably want compatability normalization (like NFKC), but compatability normalization forms also looses data. For instance, the ligature ffi: "ffi".chars.normalize(:kc) #=> "ffi" Or the ''vulgar fraction one quarter'': "¼".chars.normalize(:kc) #=> "1/4" When you''re comparing strings, you might want "¼" to be equal to "1/4". When you want your users to use nice glyphs, you can''t just discard this data. But _if_ you normalize, you have to make sure you _always_ normalize. For instance, when you save a password to the database and normalize it, you have to make sure that you always normalize passwords from forms otherwise the password might not match when filled out by the user. Using NFKC might introduce false positives because "¼".chars.normalize == "1/4".chars.normalize, which isn''t a very large problem if the rest of the password is strong enough. Currently normalization is implemented in a separate plugin called ''utf8_plugin'' [1], and can be turned on by the class method `normalize_unicode_params''. You can find more information in your Unicode Primer [2]. Manfred [1] https://fngtps.com/svn/multibyte_for_rails/utf8_plugin [2] https://fngtps.com/projects/multibyte_for_rails/wiki/UnicodePrimer --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Mathieu Jobin
2006-Sep-26 08:42 UTC
Re: ActiveSupport::Multibyte for better Unicode support
ok so ActiveSupport::Multibyte would work with SJIS and EUC-JP but it seems some extra work from someone who understand those encodings. well, I think if ActiveSupport::Multibyte gets integrated into rails with decent docs (docs that includes writting plugins for other encoding) I''m sure you have a lot more chance to see a Japanese guru sending you a patch. if it does not get integrated, they won''t know about it. or won''t care cuz it ain''t mainstream. and I am using utf-8 a good 80% of the time anyway, so I''m totally with the motion. On 9/26/06, Thijs van der Vossen <t.vandervossen@gmail.com> wrote:> > On 26 Sep 2006, at 24:30 , Pete Yandell wrote: > > On 26/09/2006, at 12:49 AM, Julian ''Julik'' Tarkhanov wrote: > >> On 25-sep-2006, at 5:06, David Goodlad wrote: > >>> I can''t see any real downside to supporting UTF-8 out of the box > >>> in Rails. > >> > >> Tell it to the Japanese and the Chinese railers. I wonder how long > >> you will stand before you get your ass served :-) > > > > Why? It''s not like Rails supports Japanese or Chinese encodings out > > of the box now. How is going from supporting just ASCII to > > supporting UTF-8 taking anything away from Japanese or Chinese > > railers? > > > > Like David said, what exactly is the downside to default UTF-8 > > support? Who does it hurt, how, and why? > > There''s no downside to default UTF-8 support, but it would be nice if > switching from the default to Shift-JIS or EUC is going to be as easy > as changing $KCODE = ''utf-8'' to $KCODE = ''sjis''. > > If you want this, please add Shift-JIS and/or EUC support in > ActiveSupport::Multibyte and send us a patch. > > Kind regards, > Thijs > > -- > Fingertips - http://www.fngtps.com > > Phone: +31 (0)6 24204845 > Skype: tvandervossen > > MSN Messenger: thijs@fngtps.com > iChat/AOL: t.vandervossen@mac.com > Jabber IM: thijs@jabber.org > > > > > >-- gcc -O0 -DRUBY_EXPORT -rdynamic -Wl,-export-dynamic -L. main.o -lruby-static -ldl -lcrypt -lm -o ruby Everyone is trying their hardest to do their job but management has set it up so that it''s impossible. Take the control over your money, track your expenses http://justbudget.com Mathieu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-26 08:46 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 9:49, Michael Koziarski wrote:> Other encodings can be support with plugins initially, I''m personally > happy with utf-8 only as a position for 1.2.+3 -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-26 08:51 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 10:06, Michael Koziarski wrote:> > It would be nice if we could make it really easy to have this stuff > ''just work'' without much in the way of additional user intervention.Normalization on input and before saving to the database, but this might scare some people off if used wrong. What Rails might do is adopt the Character Model for the Web and just stick to C normalizations everywhere. However I think this still might stay optional, because this might raise exceptions and loose ends in the situations where people send intrinsic bytestrings as input parameters. What I do is I had defined input norm as a filter for ApplicationController, as the step in the chain responsible for input sanitization. Implicit normalization at runtime is not the way because it transiently changes the offsets of strings as soon as you slice/ truncate/concatenate. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-26 08:52 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 10:06, Michael Koziarski wrote:> > So, if we merged in ActiveSupport::Multibyte, and updated helpers > like truncate to use the chars proxy, what other changes would be > required to make this stuff simple? Normalisation of input > parameters? Anything else?KCODE, all response charsets out of the box UTF, maybe processing the params with iconv according to the request-charset. But first and foremost - clear documentation. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-26 09:16 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> KCODE, all response charsets out of the box UTF, maybe processing the > params with iconv according to the request-charset.Is the request charset sent by all browsers for all requests? How risky is automatically translating with iconv (assuming it''s available)? Incidentally, this is what I meant by normalization, that''ll teach me to use a reserved word ;).> But first and foremost - clear documentation.What do you feel is currently missing from the ActiveSupport::Multibyte patch?> -- > Julian ''Julik'' Tarkhanov > please send all personal mail to > me at julik.nl > > > > > >-- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-26 09:17 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26 Sep 2006, at 10:52 , Julian ''Julik'' Tarkhanov wrote:> On 26-sep-2006, at 10:06, Michael Koziarski wrote: >> So, if we merged in ActiveSupport::Multibyte, and updated helpers >> like truncate to use the chars proxy, what other changes would be >> required to make this stuff simple? Normalisation of input >> parameters? Anything else? > > KCODE,I agree. It''s the Ruby way to set your encoding using $KCODE so Rails 1.2 should have $KCODE=''utf-8'' in environment.rb> all response charsets out of the box UTF,This is already in trunk since changeset 5129.> maybe processing the params with iconv according to the request- > charset.This is only needed for very old and badly broken browsers. I don''t think Rails should do this by default. Kind regards, Thijs --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-26 09:18 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 0:30, Pete Yandell wrote:> Like David said, what exactly is the downside to default UTF-8 > support? Who does it hurt, how, and why?It doesn''t hurt _us_. I''m 200% for it anyways, just wanted to bring the point before anyone sneaks up on us about it. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Julian ''Julik'' Tarkhanov
2006-Sep-26 09:29 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26-sep-2006, at 11:16, Michael Koziarski wrote:>> KCODE, all response charsets out of the box UTF, maybe processing the >> params with iconv according to the request-charset. > > Is the request charset sent by all browsers for all requests? How > risky is automatically translating with iconv (assuming it''s > available)? Incidentally, this is what I meant by normalization, > that''ll teach me to use a reserved word ;).I see almost no risk. t has to do with a browser (or a REST client, for that matter) using a wrong charset when doing the request. The server recieving the request can then decode the request into it''s internal encoding. This is how (among others) Trackback system works in MovableType. But just as Thijs said. we might as well omit that. It has nothing to do with normalisation.> >> But first and foremost - clear documentation. > > What do you feel is currently missing from the > ActiveSupport::Multibyte patch?As one of the authors I feel pretty secure here. Just wanted to make sure the big README we have put there gets a visible spot in the AS docs. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Charles O Nutter
2006-Sep-26 15:09 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/23/06, Sam Ruby <rubys@intertwingly.net> wrote:> How does JRuby handle strings? If they are mapped to java.lang.String, > the JRuby already has more than adequate Unicode support.JRuby does use java.lang.String, but we have to artificially downgrade everything to a single-byte encoding for Ruby''s sake. Because there''s no concept of characters versus bytes in Ruby, we can''t really support multiybyte characters or code points or what have you without creating incompatible interfaces. It''s a source of great frustration for us, so much so that we''re probably just going to create some incompatibilities to solve the Unicode issue on our end. It''s likely that in the future all strings in JRuby will be UTF-16 strings as in Java, and all operations will deal in characters instead of bytes whereever possible. We''ll deal with issues that arise as they come up, such as for handling IO that wants byte counts when we''re providing character counts.> > It seems to me that .chars should return back the same object, if the > underlying VM supports Unicode. I would guess that today that would > include JRuby, and in the future, that would include Ruby 2.0.chars would be easy to implement today; and really we may look at the ActiveSupport::MultiByte way to handle Unicode as "the one way" we also do it in JRuby. Rails is driving Unicode innovation at this point, so if this sees wider adoption we''re not opposed to including it in core JRuby. To be absolutely clear: we want to support Unicode natively in JRuby, and we''re really just looking to the community to decide what form that should take. If there''s something that can be done within Ruby 1.8-semantics that works with Ruby 1.8-compatible apps, we''ll include it. -- Contribute to RubySpec! @ www.headius.com/rubyspec Charles Oliver Nutter @ headius.blogspot.com Ruby User @ ruby.mn --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Charles O Nutter
2006-Sep-26 17:02 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/20/06, Manfred Stienstra <manfred@gmail.com> wrote:> > Three months ago Julian Tarkhanov submitted a test implementation of > his ActiveSupport::Multibyte string extension patch. Since then we''ve > been steadily improving the extension based on the feedback we received.I''m studying it now...a few notes as I go and thoughts at the end: - I think we could support Chars natively under JRuby pretty easily, though everything would be UTF-16 internally. However Java has many, many utilities already present for converting UTF-16 to damn near every encoding under the sun, so this wouldn''t be a real limitation. Native JRuby support for MultiByte could potentially be significantly faster than a pure Ruby version, but fully API-equivalent. - We''ve been kicking around the possibility of migrating to a mutable UTF-8 string inside JRuby, to avoid the wasted high byte and to get all our mutability in a single type that''s friendlier than what Java provides. If we could say that the base JRuby String implementation supports a fast, solid UTF-8 backing store normally and MultiByte''s String#chars and Chars for actual multibyte operations I think we''d have the best of all worlds I like the interface I''m seeing so far. The separation of the base "dumb" String from the "smart" multibyte-aware Chars seems to be the path of least resistance. In my opinion, it''s potentially the "right way" long term too...let String remain a dumb byte-box and provide a character-aware type that knows how to do the "right thing" with multibyte encodings. I also think you guys are going to drive unicode adoption in Ruby for the next year. Matz''s m17n is a long way out, and people need unicode now. If something like MultiByte gains serious traction, there''s going to be a lot of pressure to support that API in the long term, and there would be little reason we couldn''t support it out of the box in core JRuby right now. -- Contribute to RubySpec! @ www.headius.com/rubyspec Charles Oliver Nutter @ headius.blogspot.com Ruby User @ ruby.mn --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-26 18:22 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26 Sep 2006, at 17:09 , Charles O Nutter wrote:> [...] we''re probably just going to create some incompatibilities to > solve the Unicode issue on our end. It''s likely that in the future > all strings in JRuby will be UTF-16 strings as in Java, and all > operations will deal in characters instead of bytes whereever > possible. We''ll deal with issues that arise as they come up, such > as for handling IO that wants byte counts when we''re providing > character counts.Early versions of the unicode_hacks plugin redefined string methods to work on codepoints instead of bytes. This turned out to break a lot of libraries and applications in sometimes subtle but very nasty ways. Patching up IO might work, but suppose you have something like this: header(''Content-Length'', body.length) Here, length must return the number of bytes and not the number of characters. How can you ever know what to return in this case? Kind regards, Thijs
Charles O Nutter
2006-Sep-26 18:59 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/26/06, Thijs van der Vossen <t.vandervossen@gmail.com> wrote:> On 26 Sep 2006, at 17:09 , Charles O Nutter wrote: > > [...] we''re probably just going to create some incompatibilities to > > solve the Unicode issue on our end. It''s likely that in the future > > all strings in JRuby will be UTF-16 strings as in Java, and all > > operations will deal in characters instead of bytes whereever > > possible. We''ll deal with issues that arise as they come up, such > > as for handling IO that wants byte counts when we''re providing > > character counts. > > Early versions of the unicode_hacks plugin redefined string methods > to work on codepoints instead of bytes. This turned out to break a > lot of libraries and applications in sometimes subtle but very nasty > ways. Patching up IO might work, but suppose you have something like > this: > > header(''Content-Length'', body.length) > > Here, length must return the number of bytes and not the number of > characters. How can you ever know what to return in this case?It''s for exactly this reason I advocated a separate char sequence type in future Ruby versions, and why I like AS::MB''s approach to the problem best so far.> > Kind regards, > Thijs > > > > >-- Contribute to RubySpec! @ www.headius.com/rubyspec Charles Oliver Nutter @ headius.blogspot.com Ruby User @ ruby.mn --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-26 19:30 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26 Sep 2006, at 19:02 , Charles O Nutter wrote:> [...] Native JRuby support for MultiByte could potentially be > significantly > faster than a pure Ruby version, but fully API-equivalent.Finally a good reason to run Rails on Java! :-)> - We''ve been kicking around the possibility of migrating to a mutable > UTF-8 string inside JRuby, to avoid the wasted high byte and to get > all our mutability in a single type that''s friendlier than what Java > provides. If we could say that the base JRuby String implementation > supports a fast, solid UTF-8 backing store normally and MultiByte''s > String#chars and Chars for actual multibyte operations I think we''d > have the best of all worldsSounds like the way to go to me. UTF-8 is what Ruby has (although limited) built-in support for and for most Rails apps it''s what you have to convert to in the end anyway. Please not that ActiveSupport::Multibyte support all Unicode planes and not only the Basic Multilingual Plane. My knowledge of Java is very limited, but judging from the article at [1] working with anything beyond U+FFFF takes some serious effort. Kind regards, Thijs [1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
Charles O Nutter
2006-Sep-26 19:51 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 9/26/06, Thijs van der Vossen <t.vandervossen@gmail.com> wrote:> Please not that ActiveSupport::Multibyte support all Unicode planes > and not only the Basic Multilingual Plane. My knowledge of Java is > very limited, but judging from the article at [1] working with > anything beyond U+FFFF takes some serious effort.Yes, the limitations are fairly well-known for the "astral plane", but hopefully they''ll comprise rare edge cases. On the other hand, our potential UTF-8 string implementation (a gift from Tim Bray) is intended to do full Unicode support "the right way" so it could serve as a general-purpose AS::MB implementation in JRuby as well as our implementation of Ruby''s normal String... -- Contribute to RubySpec! @ www.headius.com/rubyspec Charles Oliver Nutter @ headius.blogspot.com Ruby User @ ruby.mn --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-27 00:27 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 26/09/2006, at 6:06 PM, Michael Koziarski wrote:> So, if we merged in ActiveSupport::Multibyte, and updated helpers > like truncate to use the chars proxy, what other changes would be > required to make this stuff simple? Normalisation of input > parameters? Anything else?As I said in an earlier email, the laundry list reads something like: - Make sure your database character set is utf8 <- this should possibly be checked by Rails - Make sure all your tables have a character set of utf8 <- this should be done in migrations - Make sure your database.yml has ''encoding: utf8'' set for each database - Put $KCODE=''u'' in your environment.rb - Add ''normalize_unicode_params :form => :kc'' to your application.rb> It would be nice if we could make it really easy to have this stuff > ''just work'' without much in the way of additional user intervention.I''ll sit down next week and write a plugin that does all this (if someone doesn''t beat me to it). Cheers, Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2006-Sep-27 01:51 UTC
Re: ActiveSupport::Multibyte for better Unicode support
> As I said in an earlier email, the laundry list reads something like: > - Make sure your database character set is utf8 <- this should > possibly be checked by Rails > - Make sure all your tables have a character set of utf8 <- this > should be done in migrations > - Make sure your database.yml has ''encoding: utf8'' set for each databaseWe can''t change these without the users intervention, and doing utf-8 with postgres is a little harder than just ''setting the encoding'' for the table. Perhaps this is just something we need to include in our documentation?> - Put $KCODE=''u'' in your environment.rbWe could update the railties templates, but people will still need to manually update their application.> - Add ''normalize_unicode_params :form => :kc'' to your application.rbWhy do we need this? I can understand the rationale for doing iconv for ''differently encoded'' strings, but can''t quite follow the justification of normalization.> I''ll sit down next week and write a plugin that does all this (if > someone doesn''t beat me to it).Sounds good. -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Manfred Stienstra
2006-Sep-27 06:48 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 27, 2006, at 2:27 AM, Pete Yandell wrote:> > As I said in an earlier email, the laundry list reads something like: > - Make sure your database character set is utf8 <- this should > possibly be checked by RailsLike someone said before, setting your database encoding to utf-8 is only important if you want to do string operations. Otherwise you can just use the database as a bitbucket and it won''t matter. I think this this should be the default in railties and not handled by a plugin.> - Make sure all your tables have a character set of utf8 <- this > should be done in migrationsThe best solution is to set the default encoding of the database when you create it, that way you can''t miss a table and you still have the option to override it for certain tables.> - Make sure your database.yml has ''encoding: utf8'' set for each > databaseAgain, I think this a matter of defaults in Rails.> - Put $KCODE=''u'' in your environment.rbThis should probably be a default in environment.rb if we want Rails to be completely utf-8.> - Add ''normalize_unicode_params :form => :kc'' to your application.rbCompatability normalization should _never_ be a default, because it causes data loss. If there is a default, it should probably be NFC or NFD. I''m still not convinced it''s important to normalize all incoming data.> I''ll sit down next week and write a plugin that does all this (if > someone doesn''t beat me to it).The plugin that defines normalize_unicode_params is called utf8_plugin and it''s in the same repository as the rest of Multibyte for Rails stuff. It was meant as a plugin to do all the utf-8 settings and operations you need to do utf-8 in a Rails application. The plugin is a descendant of unicode_hacks and in the past this also set the database client encoding and the content-type header. We feel this is no longer necessary, we this it''s better to solve this with good defaults and documentation. Thijs van der Vossen is currently writing a series of blog posts on our weblog about which steps you have to take to have a fully utf-8 Rails, we hope to convert this to documentation in the near future. I am in no way trying to stop you from writing your own plugin, but I hope you don''t waste time going down the same route we did. Manfred --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-28 00:15 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 27/09/2006, at 11:51 AM, Michael Koziarski wrote:>> As I said in an earlier email, the laundry list reads something like: >> - Make sure your database character set is utf8 <- this should >> possibly be checked by Rails >> - Make sure all your tables have a character set of utf8 <- this >> should be done in migrations >> - Make sure your database.yml has ''encoding: utf8'' set for each >> database > > We can''t change these without the users intervention, and doing utf-8 > with postgres is a little harder than just ''setting the encoding'' for > the table. Perhaps this is just something we need to include in our > documentation?We can certainly make sure tables created with migrations have the right character set, and we can at least check and give a warning if the various character sets (database, table, connection, Rails) don''t match up. I don''t know what''s required for Postgres, but I''ll build for MySQL and somebody with Postgres experience can extend from there.>> - Add ''normalize_unicode_params :form => :kc'' to your application.rb > > Why do we need this? I can understand the rationale for doing iconv > for ''differently encoded'' strings, but can''t quite follow the > justification of normalization.Because there are characters in unicode (for example ''ü'') that can be encoded in multiple possible ways (either as a single ''ü'' character, or as a ''u'' followed by an umlaut modifier), and without normalisation comparisons between them will fail. This is probably the trickiest area. When and how to normalise is something a developer really needs to think about, so just enabling auto-normalisation of all parameters is possibly not the solution. Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-28 00:25 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 27/09/2006, at 4:48 PM, Manfred Stienstra wrote:> On Sep 27, 2006, at 2:27 AM, Pete Yandell wrote: >> >> As I said in an earlier email, the laundry list reads something like: >> - Make sure your database character set is utf8 <- this should >> possibly be checked by Rails > > Like someone said before, setting your database encoding to utf-8 is > only important if you want to do string operations. Otherwise you can > just use the database as a bitbucket and it won''t matter. I think > this this should be the default in railties and not handled by a > plugin.I vehemently disagree! :) If you''re storing UTF8 data, you shouldn''t have your database think it''s latin1. (You can only get away with this at all because MySQL is lazy with checking strings.) Backing up, exporting, or accessing the database through something other than Rails can all give you trouble if you do this.>> - Make sure all your tables have a character set of utf8 <- this >> should be done in migrations > > The best solution is to set the default encoding of the database when > you create it, that way you can''t miss a table and you still have the > option to override it for certain tables.I agree, but it would be nice to have Rails at least say "I think I''m running in utf8 mode, so I''d better make sure the database agrees and warn the developer if it doesn''t.">> - Add ''normalize_unicode_params :form => :kc'' to your application.rb > > Compatability normalization should _never_ be a default, because it > causes data loss. If there is a default, it should probably be NFC or > NFD. I''m still not convinced it''s important to normalize all incoming > data.Yep, fair call. I think this is the trickiest point.> I am in no way trying to stop you from writing your own plugin, but I > hope you don''t waste time going down the same route we did.Well, if nothing else my plugin will be useful to me. I''m sick of having to go through all the steps required to support unicode in every app I write, and I''ve accidentally missed steps before in ways that have caused nasty data corruption and been hard to fix. (Try setting your database character set to utf8, but not setting the connection character set.) Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Michael Glaesemann
2006-Sep-28 02:00 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 28, 2006, at 9:15 , Pete Yandell wrote:> We can certainly make sure tables created with migrations have the > right character set, and we can at least check and give a warning if > the various character sets (database, table, connection, Rails) don''t > match up. > > I don''t know what''s required for Postgres, but I''ll build for MySQL > and somebody with Postgres experience can extend from there.In PostgreSQL, encoding is a database-level setting, not a table attribute. IIRC, changing from one encoding to another requires dumping the database, passing the dump through iconv, creating a new database with the target encoding, and loading the dump into the new database. Michael Glaesemann grzm seespotcode net --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Pete Yandell
2006-Sep-28 02:14 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 28/09/2006, at 12:00 PM, Michael Glaesemann wrote:> On Sep 28, 2006, at 9:15 , Pete Yandell wrote: > >> We can certainly make sure tables created with migrations have the >> right character set, and we can at least check and give a warning if >> the various character sets (database, table, connection, Rails) don''t >> match up. >> >> I don''t know what''s required for Postgres, but I''ll build for MySQL >> and somebody with Postgres experience can extend from there. > > In PostgreSQL, encoding is a database-level setting, not a table > attribute. IIRC, changing from one encoding to another requires > dumping the database, passing the dump through iconv, creating a new > database with the target encoding, and loading the dump into the new > database.Yep, which is yet another reason to have UTF8 be the convention for new Rails apps. Re-encoding all the strings in your database is not fun. Pete Yandell --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-28 07:04 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 28 Sep 2006, at 02:15 , Pete Yandell wrote:> We can certainly make sure tables created with migrations have the > right character set,If you set the default character set to utf-8 when you create a MySQL or PostgreSQL database, all tables you create using Rails migrations will inherit this character set. In other words, if you create your MySQL database with: > CREATE DATABASE db_name CHARACTER SET utf8 COLLATE utf8_unicode_ci; or your PostgreSQL database with: $ createdb --encoding=UTF8 db_name all tables will use UTF-8 so there''s not really a need to set or check the character set on the tables. Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Julian ''Julik'' Tarkhanov
2006-Sep-28 19:01 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 28-sep-2006, at 4:00, Michael Glaesemann wrote:> In PostgreSQL, encoding is a database-level setting, not a table > attribute. IAFAIK it''s customizable all the way, from the cluster to the database to the tables and columns. And the locale of the postmaster user plays it''s part too. -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---
Thijs van der Vossen
2006-Sep-28 19:15 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On 28 Sep 2006, at 21:01 , Julian ''Julik'' Tarkhanov wrote:> On 28-sep-2006, at 4:00, Michael Glaesemann wrote: >> In PostgreSQL, encoding is a database-level setting, not a table >> attribute. I > AFAIK it''s customizable all the way, from the cluster to the > database to the tables and columns.You can set the encoding on multiple levels, but you can only set the locale, which defines the collation when you create the ''database cluster''. You can get a list of available locales on your system with locale -a> And the locale of the postmaster user plays it''s part too.Only if you so not set the local when you create the ''database cluster''. The easiest way that we could find to do UTF-8 with PostgreSQL is to first create the ''database cluster'' with: $ initdb --locale=en_GB.UTF-8 -D data_dir ...and then create the databases with something like: $ createdb --encoding=UTF8 db_name And yes, you''ve read it right, you can only get _one collation type_ for your cluster... Kind regards, Thijs -- Fingertips - http://www.fngtps.com Phone: +31 (0)6 24204845 Skype: tvandervossen MSN Messenger: thijs@fngtps.com iChat/AOL: t.vandervossen@mac.com Jabber IM: thijs@jabber.org
Michael Glaesemann
2006-Sep-29 02:17 UTC
Re: ActiveSupport::Multibyte for better Unicode support
On Sep 29, 2006, at 4:01 , Julian ''Julik'' Tarkhanov wrote:> > > On 28-sep-2006, at 4:00, Michael Glaesemann wrote: > >> In PostgreSQL, encoding is a database-level setting, not a table >> attribute. I > AFAIK it''s customizable all the way, from the cluster to the database > to the tables and columns. > And the locale of the postmaster user plays it''s part too.Could you please point me to where you can specify table or column encodings separate from those of the database? Encoding from the client side is negotiated by the client (so you might be sending Latin-1 to the server and it gets translated to the database encoding) so in some (weak) sense you can handle the data for tables and columns in different encodings *on the client side*, but on the server side, the encoding is fixed for the database at the time of database creation. At the time of initdb, a default encoding can be chosen for the entire cluster, but it can be overridden for individual databases at the time of database creation. http://www.postgresql.org/docs/8.1/interactive/multibyte.html Michael Glaesemann grzm seespotcode net --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core -~----------~----~----~----~------~----~------~--~---