thr3ads.net - Rails core - Investigating i10n/i18n issues [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Julian ''Julik'' Tarkhanov

2005-Dec-18 19:23 UTC

Investigating i10n/i18n issues

Hello to everyone on the Core.

Recently I promised Joshua Harvey (of the Globalize plugin fame) to  
investigate the Rails code for possible multibyte issues. Pity that I  
didn''t have much time to do it quickly, but my findings are sad  
(although productive). Not so long ago I have filed a bug #2103 which  
got a prompt fix by Jamis.

The name of the bug reads: truncate() helper is not multibyte-safe

The actual name of it should have been: String#[] method is broken  
for multibyte strings

Yes, this is not a Rails problem. Most of the String methods in Ruby  
are not mb-safe, although String implies working with characters  
instead of bytes. To fix the bug I filed, Jamis needed to introduce  
ALL THAT (http://dev.rubyonrails.org/changeset/2265) for the fix and  
the test, including a special "sandbox" mode to test the effects of  
the helper. I assume that for every situation where a bug like this  
is found, just as many lines are going to be needed (sandboxed test +  
code fork at the end-user API). And I investigated how much of these  
might that be.

The response is the following: all of Rails. Take a look, for  
example, at this file within ActiveSupport.

http://dev.rubyonrails.org/browser/trunk/activesupport/lib/ 
active_support/core_ext/string/access.rb

Let me tell you, all of this is broken. It''s broken in Ruby and it  
stays broken in Rails. Because when you feed them multibyte strings  
you better be lucky that your Range covers the complete codepoints -  
otherwise you invalidate your output for ANY meaningful use (XML,  
conversion to another encoding etc.) - you can slice "into" a  
character and you will. And there is a very big problem which adds  
insult to injury.

  _Most Rails developers will never notice_. Why, you ask? Well,  
here''s the answer.

By default, Ruby uses UTF-8 for the "unicoded" $KCODE setting. In  
UTF-8, all Latin-1 characters actually stay single-byte, so you would  
never damage them by using "foobar"[0..2]. And you would always get  
correct "reversed" string.

But as soon as you pop ONE umlaut in there, as soon as you enter ONE  
character which is not single-byte you introduce an error. Recently,  
I read this entry on the blog of Lucas Carlson.

http://tech.rufy.com/entry/93

  Guess, WHY he is advising me to use "require ''jcode"?
Because he
never notices that his handling is broken until he both:
a) actually enters a multibyte character into his string
c) this multibyte character happens to exist right at the "slice" of  
the Range

Same with ActiveSupport. Essentially speaking, all of the problems  
that Ruby has with regards to multibyte handling, persist well into  
Rails, up to it''s uppermost layers (such as RJS). Moreover - this is  
actually a tip of the iceberg. If we try to discover and file EVERY  
bug that appears in Rails with regards to multibyte handling,  
hundreds lines of code will appear to fix the issue at the wrong  
level of the stack.

Let''s see. To handle Unicode properly in a web app, we actually need  
it correctly and transparently handled across the following stacks:

[  database ] -- should normalize, store and sort
[  database driver ] -- should set the right client encoding
[  ruby ] -- should operate on strings properly <<BROKEN>>
[  rails ] -- should set the right headers and coodrinate input and  
output
--------
[ web-server ] -- should not do any implicit reencoding (some do, too  
long to explain here)
[ proxy etc. ] - same as above
[ browser ] - should display and accept multibyte characters properly


Now the problem is, that fixes such as the one for truncate() are NOT  
the solution, because they fix what has to be fixed in Ruby itself.   
If we look at this part of the stack more closely, we will see  
(pardon my ASCII):

[ Ruby ]
[           Rails
[   [ ActiveSupport]
[     [ AR], [AP], [AWS], .....

Which means that while we are working within Rails, we can always  
expect ActiveSupport to be available! Otherwise we wouldn''t have  
things such as symbolize_keys!, 20.days.from.now etc.

Now, Matz is promising proper multibyte Strings for Ruby 2.0 The  
trouble with this is that we never know WHEN it''s coming -
it''s being
promised for years, and the emails on "broken Unicode" in Ruby just  
keep coming on ruby.lang.

So instead of reviewing ALL the (already immense) Rails codebase, I  
have a simple question.

We have a number of dependencies. We know that String IS BROKEN and  
it needs to be rewired. We know that most Multibyte-aware code is not  
using String#methods, but Rails does use them. And we know that  
ActiveSupport is implied.

OTOH, we know the following:
a) most of Rails developers are not using EUC-JP or JIS
b) the ones that NEED multibyte strings are using UTF-8
c) the ones that THINK they DON''T need have a BIG problem and need a  
slap on their head
d) the regex engine we have now is already much more multibyte-aware  
than the String methods
e) jcode.rb does something, but it''s NOT enough

Because they stand a chance of being bitten by the issue as soon as a  
First User Types The First Double-Byte Character Into One Of Their  
Forms. After that you can expect many many nastities to happen.

And on the other hand we have the Unicode gem. While Ruby 2.0 is long  
from finished, we already have Unicode-aware case conversions,  
Unicode-aware normalization and decomposition. All of these can be  
easily wired into the String class itself to provide _out_of_the_box_  
fixes to multibyte issues for EVERYONE who:

a) has the Unicode gem (I don''t know how to get it running on Win32)
b) uses UTF8 as his KCODE (which right now is a Rails requirement for  
using multibyte strings)
c) is running under ActiveSupport loaded

This also can be made optional (like ActiveSupport::use_utf8() pragma- 
like statement) For people using EUC-JP and other Kanji systems we  
really have to step out of the way (I don''t have any understanding of  
their languages to make judgements, but I suspect that most of what  
they might need from a Rails app is supported with UTF-8 - it would  
just require transcoding because of the enormous amoutn of other  
Kanji data already in the wild).

It is really that simple. Some 60 lines of String rewiring get you  
very far, they free you from slicing characters, they get you normal  
reverse() and index() mechanics etc. But - this is not "really" the  
pie of ActiveSupport, because it overrides and rewires a substantial  
CORE language feature. And if one would say "it''s nasty to
override
the core language" I would agree - but not in the case of Rails.  
Currently, a rewired String class would provide _exactly_ the same  
functionality as the default String class outlined in Ruby2.0 by Matz  
(character oriented vs. byte oriented - and that''s how it works now  
for ASCII).

So the question is quite simple.

Is this a viable path? Fix String for UTF-8 users once and for all  
and get a substantial part of Rails to be multibyte-safe actually  
_for free_, or go on, sticking our heads in the sand, finding bugs in  
Rails itself and (temporarily) healing the symptoms instead of the  
malady?

This brings in another issue of Unicode support. The Python and Perl  
ways of doing it are to distinguish between a "bytestring" and a  
"unicode string". This is a way of the apocalypse. It implies that  
every developer, in every function, in every subroutine and every  
block call must explicitly cast one into the other (because you never  
can be sure which one you are getting). MovableType circumvents this  
by processing ALL as bytestrings (doing the unpack+pack voodoo to  
shake "off" the UTF flag), other packages do other things - but the  
problem STICKS, because all of the developers prefer to output  
"normal" bytestrings and get them in as well. Which has led me to a  
simple realisation:

* * * *  As long as multibyte support is optional, nobody gives a  
sh..t if it works.

Let''s take a simple example. Someone makes a helper that truncates  
the excerpt of the entry automatically to N characters. Let''s ask  
ourselves: if he wanted to do it properly, would he look into the  
library "ActiveSupport" which would add "safe_truncate" to
String or
would he just call string[0..len] ? What would you do?

ActiveSupport is a vey good and vast Ruby extension module. Why  
couldn''t we add something _really_ important to it instead of  
syntactic sugar only? Something that really many people need?  
Something that would fix all the stack UNDER the Rails components so  
that nobody even has to THINK about bugs like #2103, if not only for  
the reason of the ignorance of the developer alone (like in the post  
by Lucas I''ve linked to)?

What do you think? Please note that I am heavily biased because every  
single piece of software I used since I was 12 had problems with  
Russian letters, and Rails is no exception 10 years later, on a fully  
Unicode-capable Unix box. If the core language has to be bent INTO  
shape (I call this "into" rather than "out of") to make
things Just
Work, why not?

--
Julian ''Julik'' Tarkhanov
me at julik.nl

Jean-Christophe Michel

2005-Dec-19 21:10 UTC

head link

Re: Investigating i10n/i18n issues

Julian ''Julik'' Tarkhanov a écrit :> What do you think? Please note that I am heavily biased because every 
> single piece of software I used since I was 12 had problems with 
> Russian letters, and Rails is no exception 10 years later, on a fully 
> Unicode-capable Unix box. If the core language has to be bent INTO 
> shape (I call this "into" rather than "out of") to make
things Just
> Work, why not?
+1
As a European I would be pleased to have an error free unicode layer :-)
I left Php hoping ruby was more advanced on this side...

-- 
Jean-Christophe Michel

Thijs Van Der Vossen

2005-Dec-20 08:04 UTC

head link

Re: Investigating i10n/i18n issues

On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote:> Julian ''Julik'' Tarkhanov a écrit :
>> What do you think? Please note that I am heavily biased because every
>> single piece of software I used since I was 12 had problems with
>> Russian letters, and Rails is no exception 10 years later, on a fully
>> Unicode-capable Unix box. If the core language has to be bent INTO
>> shape (I call this "into" rather than "out of") to
make things Just
>> Work, why not?
>
> +1
> As a European I would be pleased to have an error free unicode  
> layer :-)
> I left Php hoping ruby was more advanced on this side...
In Ruby you can store utf-8 encoded text in strings, use regexes on  
utf-8 encoded strings and convert between different encodings using  
the iconv library. If I''m not mistaken, this is basically the same as  
what you can do in PHP.

If you _need_ a dynamic language with a true and tested Unicode  
String type _right now_ you might want to take a look at Python. ;-)

Kind regards,
Thijs

Joshua Harvey

2005-Dec-20 16:04 UTC

head link

Re: Investigating i10n/i18n issues

I think Julik brought up a very important issue, and I wish it had
gotten more attention. Ruby's Unicode string handling is broken,
mostly because it doesn't count multibyte characters correctly.

Thijs Van Der Vossen wrote:>If you _need_ a dynamic language with a true and tested Unicode
>String type _right now_ you might want to take a look at Python. ;-)
Well, Julik did have a look at Python: "The Python and Perl
ways of doing it are to distinguish between a 'bytestring' and a
'unicode string.' This is a way of the apocalypse."

More importantly, though, why should we defer to other languages and
frameworks? We love Rails, we love Ruby, and by making a small change
in the String class we'll have best-in-class Unicode support.

Add Globalize into the mix and you open up huge possibilities. Typo
with out-of-the-box support for dozens of languages, including
localized date display. Instiki with built in multi-language support,
so that the rails wiki could be easily translated into dozens of
languages. Ecommerce sites that are actually useful outside the US and
UK.

Because of the power and flexibility of Ruby and Rails, we can add
this elusive i18n stuff pretty easily. Why not do it? It's a
make-or-break feature for millions of people.

_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core

Obie Fernandez

2005-Dec-20 16:22 UTC

head link

Re: Investigating i10n/i18n issues

+1

I don''t think I understand the hesitation.

obie

On 12/20/05, Joshua Harvey <jmharvey.19309139@bloglines.com>
wrote:> I think Julik brought up a very important issue, and I wish it had
> gotten more attention. Ruby''s Unicode string handling is broken,
> mostly because it doesn''t count multibyte characters correctly.
>
> Thijs Van Der Vossen wrote:
> >If you _need_ a dynamic language with a true and tested Unicode
> >String type _right now_ you might want to take a look at Python. ;-)
>
> Well, Julik did have a look at Python: "The Python and Perl
> ways of doing it are to distinguish between a
''bytestring'' and a
> ''unicode string.'' This is a way of the apocalypse."
>
> More importantly, though, why should we defer to other languages and
> frameworks? We love Rails, we love Ruby, and by making a small change
> in the String class we''ll have best-in-class Unicode support.
>
> Add Globalize into the mix and you open up huge possibilities. Typo
> with out-of-the-box support for dozens of languages, including
> localized date display. Instiki with built in multi-language support,
> so that the rails wiki could be easily translated into dozens of
> languages. Ecommerce sites that are actually useful outside the US and
> UK.
>
> Because of the power and flexibility of Ruby and Rails, we can add
> this elusive i18n stuff pretty easily. Why not do it? It''s a
> make-or-break feature for millions of people.
>
> _______________________________________________
> Rails-core mailing list
> Rails-core@lists.rubyonrails.org
> http://lists.rubyonrails.org/mailman/listinfo/rails-core
>
>
>

Wilson Bilkovich

2005-Dec-20 16:27 UTC

head link

Re: Investigating i10n/i18n issues

I didn''t want to be the first reply, because I''m not part of
core, and
my support doesn''t mean much in the grand scheme of things.
That being said, I think the ability to ''fix this'' at the
framework
layer is one of the beautiful parts of Ruby, and we should just go
ahead and do it.
I''d be happy to contribute code, or tests on Win32, etc, etc.
I hate having a whole universe of text data I don''t
''trust'' in Ruby.

--Wilson.

On 12/20/05, Obie Fernandez <obiefernandez@gmail.com>
wrote:> +1
>
> I don''t think I understand the hesitation.
>
> obie
>
> On 12/20/05, Joshua Harvey <jmharvey.19309139@bloglines.com> wrote:
> > I think Julik brought up a very important issue, and I wish it had
> > gotten more attention. Ruby''s Unicode string handling is
broken,
> > mostly because it doesn''t count multibyte characters
correctly.
> >
> > Thijs Van Der Vossen wrote:
> > >If you _need_ a dynamic language with a true and tested Unicode
> > >String type _right now_ you might want to take a look at Python.
;-)
> >
> > Well, Julik did have a look at Python: "The Python and Perl
> > ways of doing it are to distinguish between a
''bytestring'' and a
> > ''unicode string.'' This is a way of the
apocalypse."
> >
> > More importantly, though, why should we defer to other languages and
> > frameworks? We love Rails, we love Ruby, and by making a small change
> > in the String class we''ll have best-in-class Unicode support.
> >
> > Add Globalize into the mix and you open up huge possibilities. Typo
> > with out-of-the-box support for dozens of languages, including
> > localized date display. Instiki with built in multi-language support,
> > so that the rails wiki could be easily translated into dozens of
> > languages. Ecommerce sites that are actually useful outside the US and
> > UK.
> >
> > Because of the power and flexibility of Ruby and Rails, we can add
> > this elusive i18n stuff pretty easily. Why not do it? It''s a
> > make-or-break feature for millions of people.
> >
> > _______________________________________________
> > Rails-core mailing list
> > Rails-core@lists.rubyonrails.org
> > http://lists.rubyonrails.org/mailman/listinfo/rails-core
> >
> >
> >
> _______________________________________________
> Rails-core mailing list
> Rails-core@lists.rubyonrails.org
> http://lists.rubyonrails.org/mailman/listinfo/rails-core
>

Julian ''Julik'' Tarkhanov

2005-Dec-20 16:45 UTC

head link

Re: Investigating i10n/i18n issues

On 20-dec-2005, at 9:04, Thijs Van Der Vossen wrote:
> On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote:
>> Julian ''Julik'' Tarkhanov a écrit :
>>> What do you think? Please note that I am heavily biased because  
>>> every
>>> single piece of software I used since I was 12 had problems with
>>> Russian letters, and Rails is no exception 10 years later, on a  
>>> fully
>>> Unicode-capable Unix box. If the core language has to be bent INTO
>>> shape (I call this "into" rather than "out of")
to make things Just
>>> Work, why not?
>>
>> +1
>> As a European I would be pleased to have an error free unicode  
>> layer :-)
>> I left Php hoping ruby was more advanced on this side...
>
> In Ruby you can store utf-8 encoded text in strings, use regexes on  
> utf-8 encoded strings and convert between different encodings using  
> the iconv library. If I''m not mistaken, this is basically the same
> as what you can do in PHP.
>
> If you _need_ a dynamic language with a true and tested Unicode  
> String type _right now_ you might want to take a look at Python. ;-)
Thijs, it sucks in Python too, because it''s explicit and optional.  
Please read my message more thoroughly.


--
Julian ''Julik'' Tarkhanov
me at julik.nl

Thijs Van Der Vossen

2005-Dec-20 20:40 UTC

head link

Re: Investigating i10n/i18n issues

On 20 Dec 2005, at 17:45 , Julian ''Julik'' Tarkhanov
wrote:> On 20-dec-2005, at 9:04, Thijs Van Der Vossen wrote:
>> On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote:
>>> Julian ''Julik'' Tarkhanov a écrit :
>>>> What do you think? Please note that I am heavily biased because
>>>> every
>>>> single piece of software I used since I was 12 had problems
with
>>>> Russian letters, and Rails is no exception 10 years later, on a
>>>> fully
>>>> Unicode-capable Unix box. If the core language has to be bent
INTO
>>>> shape (I call this "into" rather than "out
of") to make things Just
>>>> Work, why not?
>>>
>>> +1
>>> As a European I would be pleased to have an error free unicode  
>>> layer :-)
>>> I left Php hoping ruby was more advanced on this side...
>>
>> In Ruby you can store utf-8 encoded text in strings, use regexes  
>> on utf-8 encoded strings and convert between different encodings  
>> using the iconv library. If I''m not mistaken, this is
basically
>> the same as what you can do in PHP.
>>
>> If you _need_ a dynamic language with a true and tested Unicode  
>> String type _right now_ you might want to take a look at Python. ;-)
>
> Thijs, it sucks in Python too, because it''s explicit and optional.
> Please read my message more thoroughly.
Hi Julian, I _did_ read your message thoroughly and I think the  
changes you propose are an excellent way to fix the problem in Rails.

I don''t think I fully agree with you on the apocalypse part, but I do  
see the problem and I do think this your proposal is the best way to  
make it ''just work'' in Rails without breaking anything.

Kind regards,
Thijs

--
Fingertips - http://www.fngtps.com
+31 (0)6 24204845
thijs@jabber.org

David Heinemeier Hansson

2005-Dec-25 18:59 UTC

head link

Re: Investigating i10n/i18n issues

> Is this a viable path? Fix String for UTF-8 users once and for all
> and get a substantial part of Rails to be multibyte-safe actually
> _for free_, or go on, sticking our heads in the sand, finding bugs in
> Rails itself and (temporarily) healing the symptoms instead of the
> malady?
I don''t think anyone would be against having better UTF-8 support for
free. The problem in the past has just been that free wasn''t so.
Usually, it would be that it killed performance. So we can''t really
say yes or no before we have an implementation that''s real and where
we can weigh the cons versus the pros.

So. Please do go ahead and make a fixed String in Active Support. Then
examine all the cons. Like do some serious benchmarking on real apps
with and without the fix. Consider how this would break backwards
compatibility. Then write it all up in an email to this list. If the
case is persuasive, I will not stand in the way for its inclusion.

Also, please do dig into the ruby-talk archives to find older
discussions on this subject. I believe it has been discussed
extensively in the past and you might be able to find some good
arguments that can help the implementation.

Best of luck!

BTW, I believe you''re in a uniquely qualified position to do this
work. Simply because you want it the most :). That has always been the
most powerful motivator in open source. Right on!
--
David Heinemeier Hansson
http://www.loudthinking.com -- Broadcasting Brain
http://www.basecamphq.com   -- Online project management
http://www.backpackit.com   -- Personal information manager
http://www.rubyonrails.com  -- Web-application framework

Rails core - Dec 2005 - Investigating i10n/i18n issues

Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues

Re: Investigating i10n/i18n issues