All, As some of you know, Jeroen has added support for Unicode strings in the unstable development version of FOX (version 1.5). I''m trying to plan ahead to decide how best to support this for FXRuby 1.6, but I don''t really know anything about Ruby''s support for Unicode or i18n in general. If you''re familiar with this topic (how/if Ruby deals with Unicode strings) I''d appreciate some pointers. Thanks, Lyle
olivers@mondrian-ide.com
2005-Sep-02 19:44 UTC
[fxruby-users] Unicode support in FXRuby 1.6
Did you get any information on this? I decided to look around a bit since I had to learn a bit of Unicode for work the other day. I didn''t find out much, but here it is in the interest of getting a discussion going: require ''jcode'' # japanese character support module $KCODE = ''u'' # tells ruby to use the UTF-8 character set utf_string = "\xc2\xa9" # UTF-8 code for the copyright sign utf_string.length # -> 2, just a byte count utf_string.jlength # -> 1, the number of UTF-8 characters That''s all I could get to happen. So you can store Unicode strings with the ''\x'' escape code, but you have to type in the UTF or Japanese encoded bytes manually (no u''Unicode String'' like in Python) and ruby doesn''t really know the difference. There are some string utilities in the jcode module, and jcode also alludes to a PATTERN_UTF8 option which allows you to use Regexps with Unicode but it wasn''t defined for me. There is also an ''iconv'' module which allows you to convert between character sets, but it is just a wrapper around a unix utility and is not available for me on Windows XP. Oliver> All, > > As some of you know, Jeroen has added support for Unicode strings in > the unstable development version of FOX (version 1.5). I''m trying to > plan ahead to decide how best to support this for FXRuby 1.6, but I > don''t really know anything about Ruby''s support for Unicode or i18n in > general. If you''re familiar with this topic (how/if Ruby deals with > Unicode strings) I''d appreciate some pointers. > > Thanks, > > Lyle > > _______________________________________________ > fxruby-users mailing list > fxruby-users@rubyforge.org > http://rubyforge.org/mailman/listinfo/fxruby-users >
> That''s all I could get to happen. So you can store Unicode strings with > the ''\x'' escape code, but you have to type in the UTF or Japanese encoded > bytes manually (no u''Unicode String'' like in Python) and ruby doesn''t > really know the difference.Can''t you store the ruby scripts as UTF8 encoded text files? Or will the ruby interperter struggle with that?> There are some string utilities in the jcode > module, and jcode also alludes to a PATTERN_UTF8 option which allows you > to use Regexps with Unicode but it wasn''t defined for me.> There is also > an ''iconv'' module which allows you to convert between character sets, but > it is just a wrapper around a unix utility and is not available for me on > Windows XP.FOX has several text codecs already buildin. Sander
> > That''s all I could get to happen. So you can store Unicode strings with > > the ''\x'' escape code, but you have to type in the UTF or > Japanese encoded > > bytes manually (no u''Unicode String'' like in Python) and ruby doesn''t > > really know the difference. > > Can''t you store the ruby scripts as UTF8 encoded text files? Or > will the ruby > interperter struggle with that?I think there is a way to do that, since there''s a -K option to the ruby interpreter which allows you to specify UTF-8, EUC or Shift-JIS, but when I tried it choked on the non-ascii character regardless (I just saved a simple ruby file in UTF-8 format with notepad and did ''ruby -Ku test.rb''). I guess the $KCODE system variable is supposed to do this also. I just noticed an argument to Regexp.new which allows you to specify from the same charset choices. I can use a UTF-8 two-byte character in a character class and it works as expected. Oliver
That''s pretty much correct. Ruby''s Unicode support is somewhat weak compared to python or perl. Only UTF-8 is supported. No support for UTF-16 is available, afaik. Basically... here''s everything you wanted to know about ruby''s Unicode but were afraid to ask.... * $KCODE can be set to support an encoding directly, but this is *NOT* needed to have a script work with unicode. It is just a simple shortcut so that any regex like /./ will do the right thing. * Without $KCODE, regexp with unicode support is available. It is done using /u language option, like t =~ //u or Regexp.new(regex, options, ''u'') (or, alternatively, //m which is for multi-byte -- meaning ANSI, UTF-8, EUC, or SJIS depending on what $KCODE is set to, albeit I believe this is now no longer needed as setting $KCODE will alredy adjust all regexes). * Supporting u"" like python can be added to some extent very easily. See: http://redhanded.hobix.com/inspect/closingInOnUnicodeWithJcode.html This allows you to then do: c = u''U+00a9'' # same as \xc2\xa9 * You can also use: [].pack(''U*'') "".unpack(''U*'') to pack/unpack utf-8 strings. This allows you to easily count characters and iterate thru them, without the need of jcode (which really is only needed for getting succ to work). * jcode.rb is kind of a ruby hack and it is incomplete. Methods such as: reverse, capitalize, casecmp, swapcase, all the strip functions and probably others are not defined and will return incorrect results, depending on the language. * Ruby''s $KCODE does not add a UTF-8 <->Latin1 encoding conversion, unlike python''s unicode strings. So, albeit with the above, you can do: question = u''U+00bfHabla espaU+00f1ol?'' # ?Habla espa?ol? puts question similar to python''s: question = u''\u00bfHabla espa\u00f1ol?'' # ?Habla espa?ol? print question You will not get the corresponding Latin1 string when you print it (unlike python''s unicode strings). * To properly do the above, and convert Latin1<->UTF8 for printing, you should use iconv. ruby -rinconv -e ''puts Iconv.iconv("UTF-8", "ISO-8859-1", "\xf1")'' Iconv, by default, does *NOT* get installed by the One-Click Windows installer, even thou it is supposed to be a standard part of ruby. Adding something then like: class UString require ''iconv'' def to_s puts Iconv.iconv("UTF-8", "ISO-8859-1", self) end end will do the trick for Why''s UString class. * The ruby interpreter should have no problem reading a utf-8 .rb script file, but you have to prefix it by calling> ruby -Ku file.rb (or set RUBYOPTS to -Ku, so ruby always runs with that)Note, however, that window''s notepad, when saving UTF-8 files adds a valid albeit meaningless 3-byte BOM (byte-order sequence) at start which will not work fine with ruby1.8 (and will also corrupt unix shebang lines on most -all?- unixes). This sequence is not valid utf-8 unicode, albeit it is allowed by the standard. Ruby, just as Unix shebangs, does not deal with this appropiately.
Oh yeah... the plan for ruby2.0 (or 1.9?) Unicode is to have: http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html so what does this mean for fxruby? Well, it means that Unicode support could probably be implemented in either one of two ways: a) By using FXRuby''s own FXString, which would do all the encoding and which would support a constructor like: class FXUnicodeString < String def initialize( str, encoding = $KCODE ) # with encoding being ASCII (latin1, really), Unicode, EJIS or EUC end # ...etc... # ...with all of ruby''s standard String methods implemented, using fox''s unicode as the backend. end # And perhaps... for ease of use... class Kernel def u(str) FXUnicodeString.new(str, ''U'') end end or... b) Simply by having a similar text function for the widgets with unicode, so that, besides: text() text=() # both returning the string unprocessed. there''s also text(str, encoding) text_enc() # returns [text, encoding] # if fox remembers the encoding Obviously, a) is better as that it would be more similar to what ruby plans to eventually do with unicode support (and thus, eventually, FXUnicodeString could just be replaced with ruby''s String itself), albeit it may end up being more work in having to implement all of ruby''s methods.
> (or, alternatively, //m which is for multi-byte -- meaning ANSI, UTF-8, > EUC, or SJIS depending on > what $KCODE is set to, albeit I believe this is now no longer needed as > setting $KCODE will alredy > adjust all regexes). >Err... actually this is not correct. /m is for multi-line in regex. Not sure what the heck I was thinking there.
Gonzalo, Thanks a lot for your thorough notes! I think you covered everything I was curious about. Oliver> -----Original Message----- > From: fxruby-users-bounces@rubyforge.org > [mailto:fxruby-users-bounces@rubyforge.org]On Behalf Of Gonzalo > Garramuno > Sent: Saturday, September 03, 2005 12:24 AM > To: fxruby-users@rubyforge.org > Subject: Re: [fxruby-users] Unicode support in FXRuby 1.6 > > > That''s pretty much correct. Ruby''s Unicode support is somewhat weak > compared to python or perl. > Only UTF-8 is supported. No support for UTF-16 is available, afaik. > > Basically... here''s everything you wanted to know about ruby''s > Unicode but > were afraid to ask.... > > * $KCODE can be set to support an encoding directly, but this is *NOT* > needed to have a script work with unicode. > It is just a simple shortcut so that any regex like /./ will do the right > thing. > > * Without $KCODE, regexp with unicode support is available. It is done > using /u language option, like > t =~ //u > or > Regexp.new(regex, options, ''u'') > (or, alternatively, //m which is for multi-byte -- meaning ANSI, UTF-8, > EUC, or SJIS depending on > what $KCODE is set to, albeit I believe this is now no longer needed as > setting $KCODE will alredy > adjust all regexes). > > * Supporting u"" like python can be added to some extent very > easily. See: > http://redhanded.hobix.com/inspect/closingInOnUnicodeWithJcode.html > This allows you to then do: > c = u''U+00a9'' # same as \xc2\xa9 > > * You can also use: > [].pack(''U*'') > "".unpack(''U*'') > to pack/unpack utf-8 strings. This allows you to easily count > characters and iterate thru them, > without the need of jcode (which really is only needed for > getting succ > to work). > > * jcode.rb is kind of a ruby hack and it is incomplete. Methods such as: > reverse, capitalize, casecmp, swapcase, all the strip functions > and probably > others are not defined and will return incorrect results, depending on the > language. > > * Ruby''s $KCODE does not add a UTF-8 <->Latin1 encoding conversion, unlike > python''s unicode strings. So, albeit with the above, you can do: > > question = u''U+00bfHabla espaU+00f1ol?'' # ?Habla espa?ol? > puts question > > similar to python''s: > question = u''\u00bfHabla espa\u00f1ol?'' # ?Habla espa?ol? > print question > > You will not get the corresponding Latin1 string when you print it (unlike > python''s unicode strings). > > * To properly do the above, and convert Latin1<->UTF8 for printing, you > should use iconv. > ruby -rinconv -e ''puts Iconv.iconv("UTF-8", "ISO-8859-1", "\xf1")'' > Iconv, by default, does *NOT* get installed by the One-Click Windows > installer, even thou it is supposed to be a > standard part of ruby. > Adding something then like: > class UString > require ''iconv'' > def to_s > puts Iconv.iconv("UTF-8", "ISO-8859-1", self) > end > end > will do the trick for Why''s UString class. > > * The ruby interpreter should have no problem reading a utf-8 .rb script > file, but you have to prefix it by calling > > ruby -Ku file.rb (or set RUBYOPTS to -Ku, so ruby always runs > with that) > Note, however, that window''s notepad, when saving UTF-8 files adds a valid > albeit meaningless 3-byte BOM (byte-order sequence) at start > which will not > work fine with ruby1.8 (and will also corrupt unix shebang lines on > most -all?- unixes). This sequence is not valid utf-8 unicode, > albeit it is > allowed by the standard. Ruby, just as Unix shebangs, does not deal with > this appropiately. > > _______________________________________________ > fxruby-users mailing list > fxruby-users@rubyforge.org > http://rubyforge.org/mailman/listinfo/fxruby-users >