Hello to everyone on the Core. Recently I promised Joshua Harvey (of the Globalize plugin fame) to investigate the Rails code for possible multibyte issues. Pity that I didn''t have much time to do it quickly, but my findings are sad (although productive). Not so long ago I have filed a bug #2103 which got a prompt fix by Jamis. The name of the bug reads: truncate() helper is not multibyte-safe The actual name of it should have been: String#[] method is broken for multibyte strings Yes, this is not a Rails problem. Most of the String methods in Ruby are not mb-safe, although String implies working with characters instead of bytes. To fix the bug I filed, Jamis needed to introduce ALL THAT (http://dev.rubyonrails.org/changeset/2265) for the fix and the test, including a special "sandbox" mode to test the effects of the helper. I assume that for every situation where a bug like this is found, just as many lines are going to be needed (sandboxed test + code fork at the end-user API). And I investigated how much of these might that be. The response is the following: all of Rails. Take a look, for example, at this file within ActiveSupport. http://dev.rubyonrails.org/browser/trunk/activesupport/lib/ active_support/core_ext/string/access.rb Let me tell you, all of this is broken. It''s broken in Ruby and it stays broken in Rails. Because when you feed them multibyte strings you better be lucky that your Range covers the complete codepoints - otherwise you invalidate your output for ANY meaningful use (XML, conversion to another encoding etc.) - you can slice "into" a character and you will. And there is a very big problem which adds insult to injury. _Most Rails developers will never notice_. Why, you ask? Well, here''s the answer. By default, Ruby uses UTF-8 for the "unicoded" $KCODE setting. In UTF-8, all Latin-1 characters actually stay single-byte, so you would never damage them by using "foobar"[0..2]. And you would always get correct "reversed" string. But as soon as you pop ONE umlaut in there, as soon as you enter ONE character which is not single-byte you introduce an error. Recently, I read this entry on the blog of Lucas Carlson. http://tech.rufy.com/entry/93 Guess, WHY he is advising me to use "require ''jcode"? Because he never notices that his handling is broken until he both: a) actually enters a multibyte character into his string c) this multibyte character happens to exist right at the "slice" of the Range Same with ActiveSupport. Essentially speaking, all of the problems that Ruby has with regards to multibyte handling, persist well into Rails, up to it''s uppermost layers (such as RJS). Moreover - this is actually a tip of the iceberg. If we try to discover and file EVERY bug that appears in Rails with regards to multibyte handling, hundreds lines of code will appear to fix the issue at the wrong level of the stack. Let''s see. To handle Unicode properly in a web app, we actually need it correctly and transparently handled across the following stacks: [ database ] -- should normalize, store and sort [ database driver ] -- should set the right client encoding [ ruby ] -- should operate on strings properly <<BROKEN>> [ rails ] -- should set the right headers and coodrinate input and output -------- [ web-server ] -- should not do any implicit reencoding (some do, too long to explain here) [ proxy etc. ] - same as above [ browser ] - should display and accept multibyte characters properly Now the problem is, that fixes such as the one for truncate() are NOT the solution, because they fix what has to be fixed in Ruby itself. If we look at this part of the stack more closely, we will see (pardon my ASCII): [ Ruby ] [ Rails [ [ ActiveSupport] [ [ AR], [AP], [AWS], ..... Which means that while we are working within Rails, we can always expect ActiveSupport to be available! Otherwise we wouldn''t have things such as symbolize_keys!, 20.days.from.now etc. Now, Matz is promising proper multibyte Strings for Ruby 2.0 The trouble with this is that we never know WHEN it''s coming - it''s being promised for years, and the emails on "broken Unicode" in Ruby just keep coming on ruby.lang. So instead of reviewing ALL the (already immense) Rails codebase, I have a simple question. We have a number of dependencies. We know that String IS BROKEN and it needs to be rewired. We know that most Multibyte-aware code is not using String#methods, but Rails does use them. And we know that ActiveSupport is implied. OTOH, we know the following: a) most of Rails developers are not using EUC-JP or JIS b) the ones that NEED multibyte strings are using UTF-8 c) the ones that THINK they DON''T need have a BIG problem and need a slap on their head d) the regex engine we have now is already much more multibyte-aware than the String methods e) jcode.rb does something, but it''s NOT enough Because they stand a chance of being bitten by the issue as soon as a First User Types The First Double-Byte Character Into One Of Their Forms. After that you can expect many many nastities to happen. And on the other hand we have the Unicode gem. While Ruby 2.0 is long from finished, we already have Unicode-aware case conversions, Unicode-aware normalization and decomposition. All of these can be easily wired into the String class itself to provide _out_of_the_box_ fixes to multibyte issues for EVERYONE who: a) has the Unicode gem (I don''t know how to get it running on Win32) b) uses UTF8 as his KCODE (which right now is a Rails requirement for using multibyte strings) c) is running under ActiveSupport loaded This also can be made optional (like ActiveSupport::use_utf8() pragma- like statement) For people using EUC-JP and other Kanji systems we really have to step out of the way (I don''t have any understanding of their languages to make judgements, but I suspect that most of what they might need from a Rails app is supported with UTF-8 - it would just require transcoding because of the enormous amoutn of other Kanji data already in the wild). It is really that simple. Some 60 lines of String rewiring get you very far, they free you from slicing characters, they get you normal reverse() and index() mechanics etc. But - this is not "really" the pie of ActiveSupport, because it overrides and rewires a substantial CORE language feature. And if one would say "it''s nasty to override the core language" I would agree - but not in the case of Rails. Currently, a rewired String class would provide _exactly_ the same functionality as the default String class outlined in Ruby2.0 by Matz (character oriented vs. byte oriented - and that''s how it works now for ASCII). So the question is quite simple. Is this a viable path? Fix String for UTF-8 users once and for all and get a substantial part of Rails to be multibyte-safe actually _for free_, or go on, sticking our heads in the sand, finding bugs in Rails itself and (temporarily) healing the symptoms instead of the malady? This brings in another issue of Unicode support. The Python and Perl ways of doing it are to distinguish between a "bytestring" and a "unicode string". This is a way of the apocalypse. It implies that every developer, in every function, in every subroutine and every block call must explicitly cast one into the other (because you never can be sure which one you are getting). MovableType circumvents this by processing ALL as bytestrings (doing the unpack+pack voodoo to shake "off" the UTF flag), other packages do other things - but the problem STICKS, because all of the developers prefer to output "normal" bytestrings and get them in as well. Which has led me to a simple realisation: * * * * As long as multibyte support is optional, nobody gives a sh..t if it works. Let''s take a simple example. Someone makes a helper that truncates the excerpt of the entry automatically to N characters. Let''s ask ourselves: if he wanted to do it properly, would he look into the library "ActiveSupport" which would add "safe_truncate" to String or would he just call string[0..len] ? What would you do? ActiveSupport is a vey good and vast Ruby extension module. Why couldn''t we add something _really_ important to it instead of syntactic sugar only? Something that really many people need? Something that would fix all the stack UNDER the Rails components so that nobody even has to THINK about bugs like #2103, if not only for the reason of the ignorance of the developer alone (like in the post by Lucas I''ve linked to)? What do you think? Please note that I am heavily biased because every single piece of software I used since I was 12 had problems with Russian letters, and Rails is no exception 10 years later, on a fully Unicode-capable Unix box. If the core language has to be bent INTO shape (I call this "into" rather than "out of") to make things Just Work, why not? -- Julian ''Julik'' Tarkhanov me at julik.nl
Julian ''Julik'' Tarkhanov a écrit :> What do you think? Please note that I am heavily biased because every > single piece of software I used since I was 12 had problems with > Russian letters, and Rails is no exception 10 years later, on a fully > Unicode-capable Unix box. If the core language has to be bent INTO > shape (I call this "into" rather than "out of") to make things Just > Work, why not?+1 As a European I would be pleased to have an error free unicode layer :-) I left Php hoping ruby was more advanced on this side... -- Jean-Christophe Michel
On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote:> Julian ''Julik'' Tarkhanov a écrit : >> What do you think? Please note that I am heavily biased because every >> single piece of software I used since I was 12 had problems with >> Russian letters, and Rails is no exception 10 years later, on a fully >> Unicode-capable Unix box. If the core language has to be bent INTO >> shape (I call this "into" rather than "out of") to make things Just >> Work, why not? > > +1 > As a European I would be pleased to have an error free unicode > layer :-) > I left Php hoping ruby was more advanced on this side...In Ruby you can store utf-8 encoded text in strings, use regexes on utf-8 encoded strings and convert between different encodings using the iconv library. If I''m not mistaken, this is basically the same as what you can do in PHP. If you _need_ a dynamic language with a true and tested Unicode String type _right now_ you might want to take a look at Python. ;-) Kind regards, Thijs
I think Julik brought up a very important issue, and I wish it had gotten more attention. Ruby's Unicode string handling is broken, mostly because it doesn't count multibyte characters correctly. Thijs Van Der Vossen wrote:>If you _need_ a dynamic language with a true and tested Unicode >String type _right now_ you might want to take a look at Python. ;-)Well, Julik did have a look at Python: "The Python and Perl ways of doing it are to distinguish between a 'bytestring' and a 'unicode string.' This is a way of the apocalypse." More importantly, though, why should we defer to other languages and frameworks? We love Rails, we love Ruby, and by making a small change in the String class we'll have best-in-class Unicode support. Add Globalize into the mix and you open up huge possibilities. Typo with out-of-the-box support for dozens of languages, including localized date display. Instiki with built in multi-language support, so that the rails wiki could be easily translated into dozens of languages. Ecommerce sites that are actually useful outside the US and UK. Because of the power and flexibility of Ruby and Rails, we can add this elusive i18n stuff pretty easily. Why not do it? It's a make-or-break feature for millions of people. _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core
+1 I don''t think I understand the hesitation. obie On 12/20/05, Joshua Harvey <jmharvey.19309139@bloglines.com> wrote:> I think Julik brought up a very important issue, and I wish it had > gotten more attention. Ruby''s Unicode string handling is broken, > mostly because it doesn''t count multibyte characters correctly. > > Thijs Van Der Vossen wrote: > >If you _need_ a dynamic language with a true and tested Unicode > >String type _right now_ you might want to take a look at Python. ;-) > > Well, Julik did have a look at Python: "The Python and Perl > ways of doing it are to distinguish between a ''bytestring'' and a > ''unicode string.'' This is a way of the apocalypse." > > More importantly, though, why should we defer to other languages and > frameworks? We love Rails, we love Ruby, and by making a small change > in the String class we''ll have best-in-class Unicode support. > > Add Globalize into the mix and you open up huge possibilities. Typo > with out-of-the-box support for dozens of languages, including > localized date display. Instiki with built in multi-language support, > so that the rails wiki could be easily translated into dozens of > languages. Ecommerce sites that are actually useful outside the US and > UK. > > Because of the power and flexibility of Ruby and Rails, we can add > this elusive i18n stuff pretty easily. Why not do it? It''s a > make-or-break feature for millions of people. > > _______________________________________________ > Rails-core mailing list > Rails-core@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails-core > > >
I didn''t want to be the first reply, because I''m not part of core, and my support doesn''t mean much in the grand scheme of things. That being said, I think the ability to ''fix this'' at the framework layer is one of the beautiful parts of Ruby, and we should just go ahead and do it. I''d be happy to contribute code, or tests on Win32, etc, etc. I hate having a whole universe of text data I don''t ''trust'' in Ruby. --Wilson. On 12/20/05, Obie Fernandez <obiefernandez@gmail.com> wrote:> +1 > > I don''t think I understand the hesitation. > > obie > > On 12/20/05, Joshua Harvey <jmharvey.19309139@bloglines.com> wrote: > > I think Julik brought up a very important issue, and I wish it had > > gotten more attention. Ruby''s Unicode string handling is broken, > > mostly because it doesn''t count multibyte characters correctly. > > > > Thijs Van Der Vossen wrote: > > >If you _need_ a dynamic language with a true and tested Unicode > > >String type _right now_ you might want to take a look at Python. ;-) > > > > Well, Julik did have a look at Python: "The Python and Perl > > ways of doing it are to distinguish between a ''bytestring'' and a > > ''unicode string.'' This is a way of the apocalypse." > > > > More importantly, though, why should we defer to other languages and > > frameworks? We love Rails, we love Ruby, and by making a small change > > in the String class we''ll have best-in-class Unicode support. > > > > Add Globalize into the mix and you open up huge possibilities. Typo > > with out-of-the-box support for dozens of languages, including > > localized date display. Instiki with built in multi-language support, > > so that the rails wiki could be easily translated into dozens of > > languages. Ecommerce sites that are actually useful outside the US and > > UK. > > > > Because of the power and flexibility of Ruby and Rails, we can add > > this elusive i18n stuff pretty easily. Why not do it? It''s a > > make-or-break feature for millions of people. > > > > _______________________________________________ > > Rails-core mailing list > > Rails-core@lists.rubyonrails.org > > http://lists.rubyonrails.org/mailman/listinfo/rails-core > > > > > > > _______________________________________________ > Rails-core mailing list > Rails-core@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails-core >
On 20-dec-2005, at 9:04, Thijs Van Der Vossen wrote:> On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote: >> Julian ''Julik'' Tarkhanov a écrit : >>> What do you think? Please note that I am heavily biased because >>> every >>> single piece of software I used since I was 12 had problems with >>> Russian letters, and Rails is no exception 10 years later, on a >>> fully >>> Unicode-capable Unix box. If the core language has to be bent INTO >>> shape (I call this "into" rather than "out of") to make things Just >>> Work, why not? >> >> +1 >> As a European I would be pleased to have an error free unicode >> layer :-) >> I left Php hoping ruby was more advanced on this side... > > In Ruby you can store utf-8 encoded text in strings, use regexes on > utf-8 encoded strings and convert between different encodings using > the iconv library. If I''m not mistaken, this is basically the same > as what you can do in PHP. > > If you _need_ a dynamic language with a true and tested Unicode > String type _right now_ you might want to take a look at Python. ;-)Thijs, it sucks in Python too, because it''s explicit and optional. Please read my message more thoroughly. -- Julian ''Julik'' Tarkhanov me at julik.nl
On 20 Dec 2005, at 17:45 , Julian ''Julik'' Tarkhanov wrote:> On 20-dec-2005, at 9:04, Thijs Van Der Vossen wrote: >> On 19 Dec 2005, at 22:10 , Jean-Christophe Michel wrote: >>> Julian ''Julik'' Tarkhanov a écrit : >>>> What do you think? Please note that I am heavily biased because >>>> every >>>> single piece of software I used since I was 12 had problems with >>>> Russian letters, and Rails is no exception 10 years later, on a >>>> fully >>>> Unicode-capable Unix box. If the core language has to be bent INTO >>>> shape (I call this "into" rather than "out of") to make things Just >>>> Work, why not? >>> >>> +1 >>> As a European I would be pleased to have an error free unicode >>> layer :-) >>> I left Php hoping ruby was more advanced on this side... >> >> In Ruby you can store utf-8 encoded text in strings, use regexes >> on utf-8 encoded strings and convert between different encodings >> using the iconv library. If I''m not mistaken, this is basically >> the same as what you can do in PHP. >> >> If you _need_ a dynamic language with a true and tested Unicode >> String type _right now_ you might want to take a look at Python. ;-) > > Thijs, it sucks in Python too, because it''s explicit and optional. > Please read my message more thoroughly.Hi Julian, I _did_ read your message thoroughly and I think the changes you propose are an excellent way to fix the problem in Rails. I don''t think I fully agree with you on the apocalypse part, but I do see the problem and I do think this your proposal is the best way to make it ''just work'' in Rails without breaking anything. Kind regards, Thijs -- Fingertips - http://www.fngtps.com +31 (0)6 24204845 thijs@jabber.org
> Is this a viable path? Fix String for UTF-8 users once and for all > and get a substantial part of Rails to be multibyte-safe actually > _for free_, or go on, sticking our heads in the sand, finding bugs in > Rails itself and (temporarily) healing the symptoms instead of the > malady?I don''t think anyone would be against having better UTF-8 support for free. The problem in the past has just been that free wasn''t so. Usually, it would be that it killed performance. So we can''t really say yes or no before we have an implementation that''s real and where we can weigh the cons versus the pros. So. Please do go ahead and make a fixed String in Active Support. Then examine all the cons. Like do some serious benchmarking on real apps with and without the fix. Consider how this would break backwards compatibility. Then write it all up in an email to this list. If the case is persuasive, I will not stand in the way for its inclusion. Also, please do dig into the ruby-talk archives to find older discussions on this subject. I believe it has been discussed extensively in the past and you might be able to find some good arguments that can help the implementation. Best of luck! BTW, I believe you''re in a uniquely qualified position to do this work. Simply because you want it the most :). That has always been the most powerful motivator in open source. Right on! -- David Heinemeier Hansson http://www.loudthinking.com -- Broadcasting Brain http://www.basecamphq.com -- Online project management http://www.backpackit.com -- Personal information manager http://www.rubyonrails.com -- Web-application framework