Julian ''Julik'' Tarkhanov
2005-Dec-21 14:53 UTC
Investigating Unicode. Take 2, with nastities and allegations.
Well, I see that my last email hasn''t generated any reaction from the Rails core team. It looks like all of them are the happy users of "plain text" (which, as we know by now, doesn''t exist, but still). I apologize in advance for the sore bitterness of this message but I see that the Rails-core STILL, despite all of the efforts, sees these issues as something you can YAGNI away, something "optional", "additional" or "plugin-able". What I will try to prove in this message is that it''s not "additional" - and more, it''s got poisonous teeth and it bites painfully. You can forgive Matz, because he has to stay above the controversy and cater to the Japanese and Chinese users, and he dislikes Unicode (like most of the enlightened Japanese do). But you can''t forgive _yourself_ because these are _your_ aplications. As a developer, you are accountable. In my first email I was talking about the low-level mechanics of this stuff. They are interesting for a Ruby internals developer or to a deep-down Ruby hacker (like Jamis), but I haven''t touched the consequences that Rails gets from this (because I thought I won''t need to draw out this knife if there is interest). Turns out we (as per David and others) are still in the cozy world of "Plain Text" though, so it''s time I better open this can of worms. Let''s skip on a second on all these nasty, disgusting "question mark in a rectangle" characters your users will see when you truncate their text improperly - this is, after all, the temple of Output, the browser domain - you sent it out, and then the browser has to cope with it according to Postel''s law. And besides there are not many of them, right? Just some lousy 5,5 billions of potential customers, right? Uhm, sorry, got a little offtopic here. Let''s move somewhat up the stack in my previous message, into a different domain - the one you care about. The one you foster and cherish. The domain of Data. Paul Battley had a good talk on the recent Eureko conference in Munich about Unicode in Ruby. Among his other slides he had "Doing mischief with Unicode". Unfortunately I couldn''t attend because Eureko effectively was on my birthday, so I found other fish to fry on that day - but you can find Paul''s presentation in it''s gory MPEG4 here: http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz I will merely expand his presentation into Rails - that''s right, we will exploit Rails with Unicode. Let''s say you are storing your data in Unicode (because if you don''t you must spend the rest of your days in Hell writing Sanskrit in octets on a concrete plate with a dinner fork). You think your bases are covered and you did require ''jcode''. Except that ''jcode'' won''t help. Let''s have a look at this nice little snippet. class User validates_presence_of :login end Looks buletproof, isn''t it? If a user enters spaces into the form they are going to get String#strip''ped, and then the text in the field is going to be String#blank?, right? So entering all spaces into the Login field won''t work, right? Well, it will. The Unicode standard, as of now, comprises 26 (!) characters which can be considered "whitespace". 26, that is, when used inside a string - when it''s at the boundaries it gets 27 (including the zero-width space AKA BOM). So let''s try: kinda_lovely_login = [ 0x0020, # White_Space # Zs SPACE 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE ].pack("U*") And lo and behold... User.new(:login=>kinda_lovely_login).save! Nice, isn''t it? If you wonder - yes, this is an exploit existing in YOUR Rails application RIGHT NOW (albeit a mild one). That one application that is sooo-web 2.0, with Ajax and stuff. If you like it, you better switch to 7-bit ASCII right away before selling it to anyone (not that you will be succesful unless you only sell to the British and American customers, and as we all know, the Web ends there). And "just using UTF-8" won''t help, because Unicode is hard. You wonder WHY that happens? Well... String#strip is Unicode-unaware. As are String#empty? and (thusly) String#blank? But don''t reach out for your fixtures just yet! Because I''m far from finished... Let''s move on: class User validates_size_of :name, :maximum=>5 end Ok, this is our User. Now let''s see if I can use this application: my_name = [1070, 1083, 1080, 1082].pack("U*") in case you wonder - this is my name in Russian, spelled like "Юлик". The one my mother gave to me. User.new(:login=>''julik'', :name=>my_name).save! /usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/ active_record/validations.rb:711:in `save!'': ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid) Ahem, wait a minute. You said it was 5 right? And of course you show it to me in a nice little error message? But I gather that my name is as many as 4 letters, and it fits the boundaries quite nicely. Well, no. String#size is not Unicode-aware, as we know - so AR just sticks to that. And my name turns out to be quite a bit longer than what I thought it might be: name.size => 8 Well, sure, Two-bytes per character. David can stick some of his nice Danish diacritics in there as well, because they ought to be double- byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal this from you as long as you stay in your cozy "plain-text" land. If you like it THAT way you better stick the following into the form: "The length of your name decomposed into bytes should be less than, or equal to 5". I bet your users will love that. Now just do a grep on Rails sources for string.size (and friends). Enjoy the mess. This is not "localization of dates and times", gentlemen, this is serious BAD. And if you still think these things are not serious and Rails can stay plain text, if you stil think this can be outsourced and YAGNI''ed away, if you think it doesn''t "touch me because most of my customers are American anyways", if you think you can sell THIS to the pointy-haired bossed, or if you think Matz (and other Japanese) will take care of it for you -- I admire you. Keep countin'' em'' bytes. -- Julian ''Julik'' Tarkhanov me at julik.nl
Thijs Van Der Vossen
2005-Dec-22 19:36 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On 21 Dec 2005, at 15:53 , Julian ''Julik'' Tarkhanov wrote:> Well, I see that my last email hasn''t generated any reaction from > the Rails core team. [...]Julian, maybe I''ve missed it, but do you have a patch for the String fix you proposed in your previous email? I really like to test our current apps against your proposed solution. Kind regards, Thijs -- Fingertips - http://www.fngtps.com +31 (0)6 24204845 thijs@jabber.org
Kyle Maxwell
2005-Dec-22 22:13 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On 12/21/05, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:> Well, I see that my last email hasn't generated any reaction from the > Rails core team. It looks like all of them are the happy users of > "plain text" (which, as we know by now, doesn't exist, but still). > > I apologize in advance for the sore bitterness of this message but I > see that the Rails-core STILL, despite all of the efforts, sees these > issues as something you can YAGNI away, something "optional", > "additional" or "plugin-able". > > What I will try to prove in this message is that it's not > "additional" - and more, it's got poisonous teeth and it bites > painfully. You can forgive Matz, because he has to stay above the > controversy and cater to the Japanese and Chinese users, and he > dislikes Unicode (like most of the enlightened Japanese do). But you > can't forgive _yourself_ because these are _your_ aplications. As a > developer, you are accountable. > > In my first email I was talking about the low-level mechanics of this > stuff. They are interesting for a Ruby internals developer or to a > deep-down Ruby hacker (like Jamis), but I haven't touched the > consequences that Rails gets from this (because I thought I won't > need to draw out this knife if there is interest). Turns out we (as > per David and others) are still in the cozy world of "Plain Text" > though, so it's time I better open this can of worms. > > Let's skip on a second on all these nasty, disgusting "question mark > in a rectangle" characters your users will see when you truncate > their text improperly - this is, after all, the temple of Output, the > browser domain - you sent it out, and then the browser has to cope > with it according to Postel's law. And besides there are not many of > them, right? Just some lousy 5,5 billions of potential customers, > right? Uhm, sorry, got a little offtopic here. > > Let's move somewhat up the stack in my previous message, into a > different domain - the one you care about. The one you foster and > cherish. The domain of Data. > > Paul Battley had a good talk on the recent Eureko conference in > Munich about Unicode in Ruby. Among his other slides he had "Doing > mischief with Unicode". Unfortunately I couldn't attend because > Eureko effectively was on my birthday, so I found other fish to fry > on that day - but you can find Paul's presentation in it's gory MPEG4 > here: > > http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz > > I will merely expand his presentation into Rails - that's right, we > will exploit Rails with Unicode. Let's say you are storing your data > in Unicode (because if you don't you must spend the rest of your days > in Hell writing Sanskrit in octets on a concrete plate with a dinner > fork). You think your bases are covered and you did require 'jcode'. > Except that 'jcode' won't help. > > Let's have a look at this nice little snippet. > > class User > validates_presence_of :login > end > > Looks buletproof, isn't it? If a user enters spaces into the form > they are going to get String#strip'ped, and then the text in the > field is going to be String#blank?, right? So entering all spaces > into the Login field won't work, right? > > Well, it will. The Unicode standard, as of now, comprises 26 (!) > characters which can be considered "whitespace". 26, that is, when > used inside a string - when it's at the boundaries it gets 27 > (including the zero-width space AKA BOM). > > So let's try: > > kinda_lovely_login = [ > 0x0020, # White_Space # Zs SPACE > 0x00A0, # White_Space # Zs NO-BREAK SPACE > 0x202F, # White_Space # Zs NARROW NO-BREAK > SPACE > ].pack("U*") > > And lo and behold... > > User.new(:login=>kinda_lovely_login).save! > > Nice, isn't it? If you wonder - yes, this is an exploit existing in > YOUR Rails application RIGHT NOW (albeit a mild one). That one > application that is sooo-web 2.0, with Ajax and stuff. If you like > it, you better switch to 7-bit ASCII right away before selling it to > anyone (not that you will be succesful unless you only sell to the > British and American customers, and as we all know, the Web ends > there). And "just using UTF-8" won't help, because Unicode is hard. > > You wonder WHY that happens? Well... String#strip is Unicode-unaware. > As are String#empty? and (thusly) String#blank? But don't reach out > for your fixtures just yet! Because I'm far from finished... > > Let's move on: > > class User > validates_size_of :name, :maximum=>5 > end > > Ok, this is our User. Now let's see if I can use this application: > > my_name = [1070, 1083, 1080, 1082].pack("U*") > > in case you wonder - this is my name in Russian, spelled like "Юлик". > The one my mother gave to me. > > User.new(:login=>'julik', :name=>my_name).save! > > /usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/ > active_record/validations.rb:711:in `save!': > ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid) > > Ahem, wait a minute. You said it was 5 right? And of course you show > it to me in a nice little error message? But I gather that my name is > as many as 4 letters, and it fits the boundaries quite nicely. Well, > no. String#size is not Unicode-aware, as we know - so AR just sticks > to that. And my name turns out to be quite a bit longer than what I > thought it might be: > > name.size > => 8 > > Well, sure, Two-bytes per character. David can stick some of his nice > Danish diacritics in there as well, because they ought to be double- > byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal > this from you as long as you stay in your cozy "plain-text" land. If > you like it THAT way you better stick the following into the form: > > "The length of your name decomposed into bytes should be less than, > or equal to 5". > > I bet your users will love that. > > Now just do a grep on Rails sources for string.size (and friends). > Enjoy the mess. > > This is not "localization of dates and times", gentlemen, this is > serious BAD. And if you still think these things are not serious and > Rails can stay plain text, if you stil think this can be outsourced > and YAGNI'ed away, if you think it doesn't "touch me because most of > my customers are American anyways", if you think you can sell THIS to > the pointy-haired bossed, or if you think Matz (and other Japanese) > will take care of it for you -- I admire you. Keep countin' em' bytes. > > -- > Julian 'Julik' Tarkhanov > me at julik.nl > > > > _______________________________________________ > Rails-core mailing list > Rails-core@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails-core >Julian, I think that everyone is with you about wanting great Unicode support in Ruby. However, to release of 1.0, all of the core team guys put in massive effort to get the release out the door. I imagine that they need some recovery time. Also, there's the holiday season, and many people are spending time with friends and family. Great Unicode support will happen sooner or later, and if you want sooner, you should start working on a patch. I'd love to contribute, but I need to get through the holidays and a major product launch in January first. -- Kyle Maxwell Chief Technologist E Factor Media // FN Interactive kyle@efactormedia.com 1-866-263-3261 _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core
Julian ''Julik'' Tarkhanov
2005-Dec-22 22:47 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On 22-dec-2005, at 20:36, Thijs Van Der Vossen wrote:> On 21 Dec 2005, at 15:53 , Julian ''Julik'' Tarkhanov wrote: >> Well, I see that my last email hasn''t generated any reaction from >> the Rails core team. [...] > > Julian, maybe I''ve missed it, but do you have a patch for the > String fix you proposed in your previous email? I really like to > test our current apps against your proposed solution. >I sent it to you off-list yesterday I believe, I am working on this: http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/ If someone wants to help out hacking I will gladly accept it. Just grab the Unicode gem, export the plugin, rake. It has some other code (some of which is addressed in the core already - like DB connection charset - but funny as it may seem this was protecting me from the effects of the infamousdatabase timeout problem). But I need more solid test coverage and not all methods are shadowed yet. Unfortunately there is no test for the core Ruby string functionality so I can''t check if I break it for anyone else. If such a test exists I would like to know where (is Rubicon still viable? it hasn''t been updated for quite some time). Right now I just filter all calls to strings which have UTF-8 semantics and only when $KCODE is UTF8. And you need to have the gem, which means that this won''t work for Windows people - they will need to find out how to build the gem themselves, I am C-illiterate. But it overrides the core Ruby class and core Ruby methods. It is, in general, a very nasty hack - a very deep one. I stand by it (and I use it daily), but I don''t know if it will work for others. I just felt very, uhm... upset when I found out that Rails basically does nothing to what is (IMO) Matz''s hesitation. There is similar ambiguity with this in PHP but every moderately large application (or framework) at least tries to tackle this through use of mb_string. I might hack on this further but I would like to know the position of the core on this. Because if you want Rails-apps to be Unicode-enabled you basically have 2 options: 1) hack the String - Matz will not produce something working in the near future. Or maybe the Pragmatic guys can convince him, because the purism of "not doing anything not to hurt nobody" is noble but long-lasting with bad side-effects. I could find talks about Unicode in Ruby going to as far back as 2002, and still absolutely niente has been done to address it at the language level. 2) fork, fork, fork. Every single string truncation or length calculation or stripping within Rails has to be forked (like the truncate() helper) 3) Make an extension of String which will accomodate hacks like mine under their own prefix, as if we were in PHP-land calling mb_functions. Again, an enormous code review process should ensue, as well as it gives us no guarantee of covering other outside libraries (or, for that matter, it gives no guarantee that a Rails core developer from the USA won''t forget that you need a prefix to count these darn letters right). I am just upset because it''s so broken and I seem to be the only one whining and asking questions. Maybe I am asking them wrong, I don''t know. Or I seem to be the only Rails user needing to use both an ß and a Ш in a single string, while everyone else is happily building this new Web 2.0 (which as it turns out has problems accepting my first and last name). Enjoy the holidays everyone! -- Julian ''Julik'' Tarkhanov me at julik.nl
Jamis Buck
2005-Dec-23 00:55 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On Dec 22, 2005, at 3:47 PM, Julian ''Julik'' Tarkhanov wrote:> I am just upset because it''s so broken and I seem to be the only > one whining and asking questions. Maybe I am asking them wrong, I > don''t know. Or I seem to be the only Rails user needing to use > both an ß and a Ш in a single string, while everyone else is > happily building this new Web 2.0 (which as it turns out has > problems accepting my first and last name).Julik, Allow me, as a core team member, to say, "the core team cares about this issue." I hope that assuages some of your pain. Now, as a core team member, allow me to say, "the core team has no experience with i18n". Allow me also to say, "the core team has no pressing needs for extensive i18n in their applications." And lastly, allow me to say (as has been said multiple times), "patches are always welcome." I apologize if I''ve come off snarky, here, but no ones like to be called insensitive. And members of the core team HAVE addressed this issue, repeatedly, and on this very list. Our universal answer is "if someone comes up with a good solution, we''ll consider it." I''m sorry if that''s not the kind of answer you want to hear, but I can promise you that the core team will not just go away for a month and come back with an i18n solution that everyone loves. Mostly because most of us have never done i18n before, and are therefore not best qualified to come up with a solution. Please, please, please, work on this. Please, please, please come up with a solution and get the other people on this list (and elsewhere) who need i18n to buy off on it, And then, please, please, please post a patch. That is the only way it''s going to happen. - Jamis
Thijs Van Der Vossen
2005-Dec-23 08:07 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On 23 Dec 2005, at 01:55 , Jamis Buck wrote:> On Dec 22, 2005, at 3:47 PM, Julian ''Julik'' Tarkhanov wrote: >> I am just upset because it''s so broken and I seem to be the only >> one whining and asking questions. Maybe I am asking them wrong, I >> don''t know. Or I seem to be the only Rails user needing to use >> both an ß and a Ш in a single string, while everyone else is >> happily building this new Web 2.0 (which as it turns out has >> problems accepting my first and last name). > > Julik, > > Allow me, as a core team member, to say, "the core team cares about > this issue." I hope that assuages some of your pain. > > Now, as a core team member, allow me to say, "the core team has no > experience with i18n". Allow me also to say, "the core team has no > pressing needs for extensive i18n in their applications." And > lastly, allow me to say (as has been said multiple times), "patches > are always welcome."Just to try clarifying the issue; Julik is _not_ whining about _extensive_ i18n at all, he is whining because Rails breaks in all kinds of subtle ways when you enter Unicode data that contains characters beyond the ''Basic Latin'' plane. Simply put, a _character_ is no longer _one byte long_ when you get beyond the characters you can see printed on your keyboard. Even simple punctuation like these “double quotation marks” take up _two bytes_ each, and stuff like ⾦ is _three bytes_ in UTF-8. Because most string handling stuff in Ruby treats each character as one byte, there a lot of places in Rails right now where _every_ character is assumed to be _one byte_ in length; which simply is not the case. Everyone interested might like to read the following articles on how Unicode and the UTF-8 encoding works: http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF http://www.joelonsoftware.com/articles/Unicode.html Kind regards, Thijs van der Vossen -- Fingertips - http://www.fngtps.com +31 (0)6 24204845 thijs@jabber.org
Jean-Christophe Michel
2005-Dec-23 08:46 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
Hi Julian, Julian ''Julik'' Tarkhanov a écrit :> Well, I see that my last email hasn''t generated any reaction from the > Rails core team. It looks like all of them are the happy users of > "plain text" (which, as we know by now, doesn''t exist, but still).... May I ask a ruby ignorant question ? How does the ruby project work ? Is there no way to fix String class directly in ruby by contributing a patch ? We could borrow code from php''s mb_string or from python to see how utf8 is unpacked. -- Jean-Christophe Michel
Michael Koziarski
2005-Dec-23 10:10 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
> Simply put, a _character_ is no longer _one byte long_ when you get > beyond the characters you can see printed on your keyboard. Even > simple punctuation like these "double quotation marks" take up > _two bytes_ each, and stuff like ⾦ is _three bytes_ in UTF-8.The problem with UTF-8 is that the length of characters varies. So something like this: a_string[434..2443] is no longer O(1). This is why things are often stored with ucs-2 internally, and converted at the boundaries. I believe this is how the JVM handles things, but I could be completely wrong. But Jamis' point is a valid one, I think one of the key reasons that rails has been successful is that we haven't just gone mad adding features left right and center. Everything which gets in is taken from an application where it's been proven. In other frameworks where this hasn't happened you get annoying bugs, and sub-par apis. i18n is something I care about, but it's not something I need for my paid work. I think the ideal way to get it into core is for people who are experts *and* need it in their paid work to produce a plugin. Then once the plugin has been in use by the community, we can roll it in. I18n is extremely important, i18n needs to end up in the core distribution. But we need to do it the 'rails way'. -- Cheers Koz _______________________________________________ Rails-core mailing list Rails-core@lists.rubyonrails.org http://lists.rubyonrails.org/mailman/listinfo/rails-core
Thijs Van Der Vossen
2005-Dec-23 10:49 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On 23 Dec 2005, at 11:10 , Michael Koziarski wrote:>> Simply put, a _character_ is no longer _one byte long_ when you get >> beyond the characters you can see printed on your keyboard. Even >> simple punctuation like these "double quotation marks" take up >> _two bytes_ each, and stuff like ⾦ is _three bytes_ in UTF-8. > > The problem with UTF-8 is that the length of characters varies. So > something like this: > > a_string[434..2443] > > is no longer O(1). This is why things are often stored with ucs-2 > internally, and converted at the boundaries. I believe this is how > the JVM handles things, but I could be completely wrong.You''re right, this is how the String class in Java stores Unicode data internally. The problem with UCS-2 is that it only allows you to encode the ''Basic Multilingual Plane'' because you can only use 16 bits for each character. Don''t confuse UCS-2 with UTF-16, where each character can take up 2 or 4 bytes. See http://en.wikipedia.org/wiki/UCS-2 for more on this. The reason we are talking about UTF-8 is that this is everyone is already using this encoding in their Rails apps and that it allows you to handle ASCII data without ever thinking about it. See http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF for why UTF-8 might actually be a good idea.> But Jamis'' point is a valid one, I think one of the key reasons that > rails has been successful is that we haven''t just gone mad adding > features left right and center. Everything which gets in is taken > from an application where it''s been proven. In other frameworks > where this hasn''t happened you get annoying bugs, and sub-par apis.This is a valid point, but it does not apply to this issue. Rails is currently annoyingly buggy when you need to handle Unicode data.> i18n is something I care about, but it''s not something I need for my > paid work. I think the ideal way to get it into core is for people > who are experts *and* need it in their paid work to produce a plugin.I think Julian might be our expert and he''s currently working on a solution. Please see his previous email for details.> Then once the plugin has been in use by the community, we can roll it > in. I18n is extremely important, i18n needs to end up in the core > distribution. But we need to do it the ''rails way''.Although you can''t have proper i18n without good Unicode support, good Unicode support is _not_ about i18n. Even if your app will never ever handle anything but english text, you still need to handle stuff like punctuation in text your users are copying and pasting from Word. Please, please, don''t ignore this issue because David said that i18n should be handled at the application level. Kind regards, Thijs van der Vossen -- Fingertips - http://www.fngtps.com +31 (0)6 24204845 thijs@jabber.org
Jamis Buck
2005-Dec-23 14:23 UTC
Re: Investigating Unicode. Take 2, with nastities and allegations.
On Dec 23, 2005, at 3:49 AM, Thijs Van Der Vossen wrote:> Please, please, don''t ignore this issue because David said that > i18n should be handled at the application level.One more time, and then I sign out of this thread for good: The rails team is NOT ignoring this issue. Rather, the rails team is waiting for someone with i18n chops to come up with a decent solution. If that person is you, then we''re waiting for you to fix the problem. If that person is not you, then you''re in the same boat we are. - Jamis
Julian ''Julik'' Tarkhanov
2005-Dec-23 19:21 UTC
Re: Investigating Unicode. Take 2, practical
On 23-dec-2005, at 15:23, Jamis Buck wrote:> On Dec 23, 2005, at 3:49 AM, Thijs Van Der Vossen wrote: > >> Please, please, don''t ignore this issue because David said that >> i18n should be handled at the application level. > > One more time, and then I sign out of this thread for good: > > The rails team is NOT ignoring this issue. Rather, the rails team > is waiting for someone with i18n chops to come up with a decent > solution. If that person is you, then we''re waiting for you to fix > the problem. If that person is not you, then you''re in the same > boat we are.That person is the collective consciense :-) Jamis, thanks for chiming in. I don''t know if it''s understandable - I just wanted someone to say that it''s indeed broken (I got nasty because I wanted to show that it''s also broken in Basecamp et al.) Let''s start simply : why don''t we assume $KCODE = ''UTF-8'' for test environment in Rails so that you don''t need to sandbox every Rails test that has to do with multibyte characters? (i.e. embrace the fact that most of the text in the world is mulibyte). So that you can type wonky literals right in the test cases and see which stuff brakes quickly, without starting an extra interpreter every time you want to truncate a string? This would be the step 0 in the right direction. I am not sure, but I assume all 37''s apps run with that setting. -- Julian ''Julik'' Tarkhanov me at julik.nl