I posted this to ruby-talk, but it occurred to me that you folks implementing Rails functionality probably have a thing or two to say about unicode support in Ruby. Therefore, I would love to hear your opinions. Adding native unicode support is only a matter of time in JRuby; its usefulness as a JVM-based language depends on it. However, we continue to wrestle with how best to support unicode without stepping on the Ruby community''s toes in the process. Thoughts? ---------- Forwarded message ---------- From: Charles O Nutter <headius@headius.com> Date: Jun 14, 2006 7:11 PM Subject: Re: Unicode roadmap? To: ruby-talk ML <ruby-talk@ruby-lang.org> Every time these unicode discussions come up my head spins like a top. You should see it. We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarressing to say that although we run on top of Java, which has arguably pretty good unicode support, we don''t support unicode. Perhaps you can see our conundrum. I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform''s encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java. We would love to be able to support unicode in JRuby, but there''s always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there''s a simple question: What form would you, the Ruby users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java''s String, is already using UTF16 strings exclusively...however there''s no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future? -- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com -- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core
On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote:> I posted this to ruby-talk, but it occurred to me that you folks > implementing Rails functionality probably have a thing or two to > say about unicode support in Ruby. Therefore, I would love to hear > your opinions. Adding native unicode support is only a matter of > time in JRuby; its usefulness as a JVM-based language depends on > it. However, we continue to wrestle with how best to support > unicode without stepping on the Ruby community''s toes in the > process. Thoughts?Julik has done a lot of pionering in that direction for Rails. His latest suggestion is to use a proxy class on string objects to perform unicode operations: @some_unicode_string.u.length @some_unicode_string.u.reverse I tend to agree with this solution as it doesn''t break any previous string operations and gives us an easy way to perform unicode aware operations. Manfred
I agree it''s a very attractive solution. I have two questions related (perhaps you are out there to answer, Julik): 1. How does performance look with the unicode string add-on versus native strings? 2. Is this the ideal way to support unicode strings in ruby? And I explain the second as follows....if we could assume that switching from treating a string as an array of bytes to a list of characters of arbitrary width, and have all existing string operations work correctly treating those characters as string, would that be a better ideal? Where are the breaking points in such a design? What''s to stop the underlying implementation from actually using a UTF-16 character, passing UTF-8 to libraries and IO streams but still allowing you to access everything as UTF-16 or your encoding of choice? (Of course this is somewhat rhetorical; we do this currently with JRuby since Java''s scrints are UTF-16...we just don''t have any way to provide access to UTF-16 characters, and we normalize everything to UTF-8 for Ruby''s sake...but what if we didn''t normalize and adjusted string functions to compensate?) On 6/14/06, Manfred Stienstra <manfred@gmail.com> wrote:> > On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote: > > > I posted this to ruby-talk, but it occurred to me that you folks > > implementing Rails functionality probably have a thing or two to > > say about unicode support in Ruby. Therefore, I would love to hear > > your opinions. Adding native unicode support is only a matter of > > time in JRuby; its usefulness as a JVM-based language depends on > > it. However, we continue to wrestle with how best to support > > unicode without stepping on the Ruby community''s toes in the > > process. Thoughts? > > Julik has done a lot of pionering in that direction for Rails. His > latest suggestion is to use a proxy class on string objects to > perform unicode operations: > > @some_unicode_string.u.length > @some_unicode_string.u.reverse > > I tend to agree with this solution as it doesn''t break any previous > string operations and gives us an easy way to perform unicode aware > operations. > > Manfred > _______________________________________________ > Rails-core mailing list > Rails-core@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails-core >-- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core
On 15-jun-2006, at 3:50, Charles O Nutter wrote:> I agree it''s a very attractive solution. I have two questions > related (perhaps you are out there to answer, Julik): > > 1. How does performance look with the unicode string add-on versus > native strings? > 2. Is this the ideal way to support unicode strings in ruby? > > And I explain the second as follows....if we could assume that > switching from treating a string as an array of bytes to a list of > characters of arbitrary width, and have all existing string > operations work correctly treating those characters as string, > would that be a better ideal? Where are the breaking points in such > a design? What''s to stop the underlying implementation from > actually using a UTF-16 character, passing UTF-8 to libraries and > IO streams but still allowing you to access everything as UTF-16 or > your encoding of choice? (Of course this is somewhat rhetorical; we > do this currently with JRuby since Java''s scrints are UTF-16...we > just don''t have any way to provide access to UTF-16 characters, and > we normalize everything to UTF-8 for Ruby''s sake...but what if we > didn''t normalize and adjusted string functions to compensate?)This is more appropriate for ruby-talk -- Julian ''Julik'' Tarkhanov please send all personal mail to me at julik.nl
Fair enough; redirected. If any other rails-core folks want to chime in, please do so...I would expect unicode and multibyte are key issues for worldwide rails deployments. On 6/14/06, Julian ''Julik'' Tarkhanov <listbox@julik.nl> wrote:> > > On 15-jun-2006, at 3:50, Charles O Nutter wrote: > > > I agree it''s a very attractive solution. I have two questions > > related (perhaps you are out there to answer, Julik): > > > > 1. How does performance look with the unicode string add-on versus > > native strings? > > 2. Is this the ideal way to support unicode strings in ruby? > > > > And I explain the second as follows....if we could assume that > > switching from treating a string as an array of bytes to a list of > > characters of arbitrary width, and have all existing string > > operations work correctly treating those characters as string, > > would that be a better ideal? Where are the breaking points in such > > a design? What''s to stop the underlying implementation from > > actually using a UTF-16 character, passing UTF-8 to libraries and > > IO streams but still allowing you to access everything as UTF-16 or > > your encoding of choice? (Of course this is somewhat rhetorical; we > > do this currently with JRuby since Java''s scrints are UTF-16...we > > just don''t have any way to provide access to UTF-16 characters, and > > we normalize everything to UTF-8 for Ruby''s sake...but what if we > > didn''t normalize and adjusted string functions to compensate?) > > This is more appropriate for ruby-talk > > -- > Julian ''Julik'' Tarkhanov > please send all personal mail to > me at julik.nl > > > _______________________________________________ > Rails-core mailing list > Rails-core@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails-core >-- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core