Thibaut Barrère
2009-Mar-01 10:57 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
Hi, not sure if it''s an oddity in my code, a bug or non-implemented feature in IronRuby or Mono - so I''m reporting it here. When using accents inside strings ("Barr?re") that I pass to either buttons or datagridviews, they translate into "BarrA?re". Here''s a sample (also available on github<http://github.com/thbar/ironruby-labs/blob/ca47f06024e936690d427d297909c9a78b0481e6/ui/006_datagridview.rb> ): form = Magic.build do form(:text => "DataGridView sample", :width => 800, :height => 600) do # nifty - current Magic.build makes it possible to reuse the control that has been added @grid = data_grid_view :dock => DockStyle.fill @grid.column_count = 2 @grid.columns[0].name = "First name" @grid.columns[1].name = "Last name" @grid.rows.add("Thibaut","Barr?re") # using my name with its nasty accent - utf-8 ? end end After editing the datagridview, I noticed a log on stdout from mono: 009-03-01 11:48:36.927 mono[5512:10b] WARNING: CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 bit chars, interpreting using MacOS Roman encoding for now, but this will change. Please eliminate usages of non-7 bit chars (including escaped characters above \177 octal) in CFSTR(). So I guess the issue probably boils down to non-MacOS Roman support in Mono. What do you think ? -- Thibaut -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20090301/35960f05/attachment-0001.html>
Thibaut Barrère
2009-Mar-03 14:35 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
Hi,> not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github):Bumping this one - do you have some idea of what''s happening there ? Is it a mono related issue ? -- Thibaut> Hi, > not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here.?When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re".?Here''s a sample (also available on github): > > form = Magic.build do > ??form(:text => "DataGridView sample", :width => 800, :height => 600) do > ????# nifty - current Magic.build makes it possible to reuse the control > that has been added > ????@grid = data_grid_view :dock => DockStyle.fill > ????@grid.column_count = 2 > ????@grid.columns[0].name = "First name" > ????@grid.columns[1].name = "Last name" > > ????@grid.rows.add("Thibaut","Barr?re") # using my name with its nasty > accent - utf-8 ? > ??end > end > > After editing the datagridview, I noticed a log on stdout from mono: > 009-03-01 11:48:36.927 mono[5512:10b] WARNING: > CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 > bit chars, interpreting using MacOS Roman encoding for now, but this will > change. Please eliminate usages of non-7 bit chars (including escaped > characters above \177 octal) in CFSTR(). > So I guess the issue probably boils down to non-MacOS Roman support in Mono. > What do you think ? > -- Thibaut
Ivan Porto Carrero
2009-Mar-03 14:57 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
No not a mono related issue. I get the same results when i run your sample on windows with MS.NET It must be an encoding thing. When I set the $KCODE to "UTF-8" it still has the same behavior which is weird I guess :) On Tue, Mar 3, 2009 at 3:35 PM, Thibaut Barr?re <thibaut.barrere at gmail.com>wrote:> Hi, > > > not sure if it''s an oddity in my code, a bug or non-implemented feature > in > > IronRuby or Mono - so I''m reporting it here. When using accents inside > > strings ("Barr?re") that I pass to either buttons or datagridviews, they > > translate into "BarrA?re". Here''s a sample (also available on github): > > Bumping this one - do you have some idea of what''s happening there ? > Is it a mono related issue ? > > -- Thibaut > > > Hi, > > not sure if it''s an oddity in my code, a bug or non-implemented feature > in > > IronRuby or Mono - so I''m reporting it here. When using accents inside > > strings ("Barr?re") that I pass to either buttons or datagridviews, they > > translate into "BarrA?re". Here''s a sample (also available on github): > > > > form = Magic.build do > > form(:text => "DataGridView sample", :width => 800, :height => 600) do > > # nifty - current Magic.build makes it possible to reuse the control > > that has been added > > @grid = data_grid_view :dock => DockStyle.fill > > @grid.column_count = 2 > > @grid.columns[0].name = "First name" > > @grid.columns[1].name = "Last name" > > > > @grid.rows.add("Thibaut","Barr?re") # using my name with its nasty > > accent - utf-8 ? > > end > > end > > > > After editing the datagridview, I noticed a log on stdout from mono: > > 009-03-01 11:48:36.927 mono[5512:10b] WARNING: > > CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 > > bit chars, interpreting using MacOS Roman encoding for now, but this will > > change. Please eliminate usages of non-7 bit chars (including escaped > > characters above \177 octal) in CFSTR(). > > So I guess the issue probably boils down to non-MacOS Roman support in > Mono. > > What do you think ? > > -- Thibaut > _______________________________________________ > Ironruby-core mailing list > Ironruby-core at rubyforge.org > http://rubyforge.org/mailman/listinfo/ironruby-core >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20090303/fdad7ceb/attachment.html>
Tomas Matousek
2009-Mar-03 17:55 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
I?ll take a look. Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Ivan Porto Carrero Sent: Tuesday, March 03, 2009 6:58 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? No not a mono related issue. I get the same results when i run your sample on windows with MS.NET<http://MS.NET> It must be an encoding thing. When I set the $KCODE to "UTF-8" it still has the same behavior which is weird I guess :) On Tue, Mar 3, 2009 at 3:35 PM, Thibaut Barr?re <thibaut.barrere at gmail.com<mailto:thibaut.barrere at gmail.com>> wrote: Hi,> not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github):Bumping this one - do you have some idea of what''s happening there ? Is it a mono related issue ? -- Thibaut> Hi, > not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github): > > form = Magic.build do > form(:text => "DataGridView sample", :width => 800, :height => 600) do > # nifty - current Magic.build makes it possible to reuse the control > that has been added > @grid = data_grid_view :dock => DockStyle.fill > @grid.column_count = 2 > @grid.columns[0].name = "First name" > @grid.columns[1].name = "Last name" > > @grid.rows.add("Thibaut","Barr?re") # using my name with its nasty > accent - utf-8 ? > end > end > > After editing the datagridview, I noticed a log on stdout from mono: > 009-03-01 11:48:36.927 mono[5512:10b] WARNING: > CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 > bit chars, interpreting using MacOS Roman encoding for now, but this will > change. Please eliminate usages of non-7 bit chars (including escaped > characters above \177 octal) in CFSTR(). > So I guess the issue probably boils down to non-MacOS Roman support in Mono. > What do you think ? > -- Thibaut_______________________________________________ Ironruby-core mailing list Ironruby-core at rubyforge.org<mailto:Ironruby-core at rubyforge.org> http://rubyforge.org/mailman/listinfo/ironruby-core -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20090303/c2cb89ee/attachment.html>
Tomas Matousek
2009-Mar-03 18:36 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
If I run this in Ruby 1.8.6:> ruby ?Ku uni.rbAnd uni.rb is UTF-8 encoded w/o BOM: puts $KCODE puts ''h?llo''.size I?ll get output: UTF-8 6 So that clearly doesn?t work as one might expect. String literals in MRI 1.8 are always binary (ie. the accented character is stored as any other 2 bytes in the string). AFAIK $KCODE only affects some built-in and library methods ? for example String#inspect, regular expression, conversion libraries, etc. Although IronRuby stores string literals in UTF16 .NET strings, to be fully compatible with MRI 1.8 we use a custom BinaryEncoding for these strings. When a string is converted to an array of bytes using this encoding, only 8 bits of each character are used (the other bits are required to be 0). This works fine for encodings that use a single byte per character. It?s broken for multi-byte encodings but that?s a problem with Ruby 1.8 in general. If you want to use Unicode you should not use 1.8 semantics. You should use -19 switch to run your script in 1.9 mode and either add a UTF8 BOM preamble or Ruby encoding magic comment: #encoding: UTF-8 puts ''h?llo''.size> ruby19 uni.rb5> ir.exe -19 uni.rb5 In a hosted app you can set 1.9 compat mode when creating the ScriptEngine/Runtime: var ruby = IronRuby.Ruby.CreateEngine((setup) => { setup.Options["Compatibility"] = RubyCompatibility.Ruby19 }); Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek Sent: Tuesday, March 03, 2009 9:56 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? I?ll take a look. Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Ivan Porto Carrero Sent: Tuesday, March 03, 2009 6:58 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? No not a mono related issue. I get the same results when i run your sample on windows with MS.NET<http://MS.NET> It must be an encoding thing. When I set the $KCODE to "UTF-8" it still has the same behavior which is weird I guess :) On Tue, Mar 3, 2009 at 3:35 PM, Thibaut Barr?re <thibaut.barrere at gmail.com<mailto:thibaut.barrere at gmail.com>> wrote: Hi,> not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github):Bumping this one - do you have some idea of what''s happening there ? Is it a mono related issue ? -- Thibaut> Hi, > not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github): > > form = Magic.build do > form(:text => "DataGridView sample", :width => 800, :height => 600) do > # nifty - current Magic.build makes it possible to reuse the control > that has been added > @grid = data_grid_view :dock => DockStyle.fill > @grid.column_count = 2 > @grid.columns[0].name = "First name" > @grid.columns[1].name = "Last name" > > @grid.rows.add("Thibaut","Barr?re") # using my name with its nasty > accent - utf-8 ? > end > end > > After editing the datagridview, I noticed a log on stdout from mono: > 009-03-01 11:48:36.927 mono[5512:10b] WARNING: > CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 > bit chars, interpreting using MacOS Roman encoding for now, but this will > change. Please eliminate usages of non-7 bit chars (including escaped > characters above \177 octal) in CFSTR(). > So I guess the issue probably boils down to non-MacOS Roman support in Mono. > What do you think ? > -- Thibaut_______________________________________________ Ironruby-core mailing list Ironruby-core at rubyforge.org<mailto:Ironruby-core at rubyforge.org> http://rubyforge.org/mailman/listinfo/ironruby-core -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20090303/d20f4970/attachment-0001.html>
Tomas Matousek
2009-Mar-03 19:37 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
Actually the 1.8 parser is somewhat influenced by the current $KCODE. Multi-byte characters could be part of identifiers and also the decision of where a string literal ends needs to deal with multi-byte characters. However, the resulting literals are just plain byte arrays with no knowledge of encoding so String#size method is still broken. To achieve a better .NET interop in IronRuby, we will honor KCODE when creating MutableStrings. The representation of the string will be byte[] if it contains any non-ascii characters and KCODE is set to a non-ascii encoding. We will also attach the KCODE encoding to the MutableString at creation time. This doesn?t affect Ruby 1.8 functionality, it only affects conversions to CLR string. So if you use KCODE = ?U? the CLR strings should be correctly encoded (they are not now as you are experiencing). I?ll implement this feature as soon as possible. Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek Sent: Tuesday, March 03, 2009 10:36 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? If I run this in Ruby 1.8.6:> ruby ?Ku uni.rbAnd uni.rb is UTF-8 encoded w/o BOM: puts $KCODE puts ''h?llo''.size I?ll get output: UTF-8 6 So that clearly doesn?t work as one might expect. String literals in MRI 1.8 are always binary (ie. the accented character is stored as any other 2 bytes in the string). AFAIK $KCODE only affects some built-in and library methods ? for example String#inspect, regular expression, conversion libraries, etc. Although IronRuby stores string literals in UTF16 .NET strings, to be fully compatible with MRI 1.8 we use a custom BinaryEncoding for these strings. When a string is converted to an array of bytes using this encoding, only 8 bits of each character are used (the other bits are required to be 0). This works fine for encodings that use a single byte per character. It?s broken for multi-byte encodings but that?s a problem with Ruby 1.8 in general. If you want to use Unicode you should not use 1.8 semantics. You should use -19 switch to run your script in 1.9 mode and either add a UTF8 BOM preamble or Ruby encoding magic comment: #encoding: UTF-8 puts ''h?llo''.size> ruby19 uni.rb5> ir.exe -19 uni.rb5 In a hosted app you can set 1.9 compat mode when creating the ScriptEngine/Runtime: var ruby = IronRuby.Ruby.CreateEngine((setup) => { setup.Options["Compatibility"] = RubyCompatibility.Ruby19 }); Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek Sent: Tuesday, March 03, 2009 9:56 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? I?ll take a look. Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Ivan Porto Carrero Sent: Tuesday, March 03, 2009 6:58 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ? No not a mono related issue. I get the same results when i run your sample on windows with MS.NET<http://MS.NET> It must be an encoding thing. When I set the $KCODE to "UTF-8" it still has the same behavior which is weird I guess :) On Tue, Mar 3, 2009 at 3:35 PM, Thibaut Barr?re <thibaut.barrere at gmail.com<mailto:thibaut.barrere at gmail.com>> wrote: Hi,> not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github):Bumping this one - do you have some idea of what''s happening there ? Is it a mono related issue ? -- Thibaut> Hi, > not sure if it''s an oddity in my code, a bug or non-implemented feature in > IronRuby or Mono - so I''m reporting it here. When using accents inside > strings ("Barr?re") that I pass to either buttons or datagridviews, they > translate into "BarrA?re". Here''s a sample (also available on github): > > form = Magic.build do > form(:text => "DataGridView sample", :width => 800, :height => 600) do > # nifty - current Magic.build makes it possible to reuse the control > that has been added > @grid = data_grid_view :dock => DockStyle.fill > @grid.column_count = 2 > @grid.columns[0].name = "First name" > @grid.columns[1].name = "Last name" > > @grid.rows.add("Thibaut","Barr?re") # using my name with its nasty > accent - utf-8 ? > end > end > > After editing the datagridview, I noticed a log on stdout from mono: > 009-03-01 11:48:36.927 mono[5512:10b] WARNING: > CFSTR("Barr\37777777703\37777777603\37777777702\37777777650re") has non-7 > bit chars, interpreting using MacOS Roman encoding for now, but this will > change. Please eliminate usages of non-7 bit chars (including escaped > characters above \177 octal) in CFSTR(). > So I guess the issue probably boils down to non-MacOS Roman support in Mono. > What do you think ? > -- Thibaut_______________________________________________ Ironruby-core mailing list Ironruby-core at rubyforge.org<mailto:Ironruby-core at rubyforge.org> http://rubyforge.org/mailman/listinfo/ironruby-core -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20090303/fab0dccf/attachment-0001.html>
Thibaut Barrère
2009-Mar-03 20:24 UTC
[Ironruby-core] Issue with accents (UTF-8) - is it supposed to work ?
Hi Tomas, thanks for your two messages and the in-depth explanation. Working with -19 and #encoding: UTF-8 indeed solves the issue (tested on Mono).> Actually the 1.8 parser is somewhat influenced by the current $KCODE. > Multi-byte characters could be part of identifiers and also the decision of > where a string literal ends needs to deal with multi-byte characters. > > However, the resulting literals are just plain byte arrays with no knowledge > of encoding so String#size method is still broken. > > To achieve a better .NET interop in IronRuby, we will honor KCODE when > creating MutableStrings. The representation of the string will be byte[] if > it contains any non-ascii characters and KCODE is set to a non-ascii > encoding. We will also attach the KCODE encoding to the MutableString at > creation time. This doesn?t affect Ruby 1.8 functionality, it only affects > conversions to CLR string. So if you use KCODE = ?U? the CLR strings should > be correctly encoded (they are not now as you are experiencing). I?ll > implement this feature as soon as possible.I think affecting strings only when conversion occurs to CLR is a pretty neat idea. I like that a lot more than having to add #encoding and -19 (also because I''m not sure what the impact would be to use -19 just for that). Because I was curious, I had a look at Rails (2.2.2) output for some of these operations: Loading development environment (Rails 2.2.2) "h?llo".size>> "h?llo".size => 6>> "h?llo".chars=> #<ActiveSupport::Multibyte::Chars:0x2378348 @wrapped_string="h?llo">>> "h?llo".chars.size=> 5>> ''?2.99''[0,1]=> "\342">> ''?2.99''.first=> "?">> ''?2.99''.first=> "?" So pretty much rough access through array is pure byte, while .first takes multibytes into account. I think the spirit of what you suggest is somewhat close from that. I like it - and will test it when you''ll have it implemented. cheers and thanks for your idea, -- Thibaut