Hi all, I was wondering if I could get some feedback on a patch I created for ActiveSupport''s `tidy_bytes` method. Right now `tidy_bytes` doesn''t work with 1.9.x, since it relies on a Unicode regexp that always fails for strings with invalid UTF-8 characters. You can see the essence of the problem easily by firing up any 1.9.x irb and doing this: ruby-1.9.2-preview1 > "\x93".split(//u) ArgumentError: invalid byte sequence in UTF-8 from (irb):2:in `split'' from (irb):2 from /Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main> This patch resolves the issue by traversing the string as bytes rather than codepoints, and is about twice as fast as the current implementation. Rather than using the current implementation''s regular expression, it checks each byte''s first 0 bit to determine its validity. This Wikipedia article was a useful reference while working on the patch: http://en.wikipedia.org/wiki/UTF-8#Description It also adds a `force` option to allow cleanup of byte sequences that are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when the developer knows that their input is encoded in CP-1252 or ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of invalid characters will prevent doing this by simply using #encode or #force_encoding on 1.9.) * The patch: http://gist.github.com/361115 * LH Ticket: https://rails.lighthouseapp.com/projects/8994/tickets/4350-tidy_bytes-fails-on-19x Here is also a library where you can see this code in isolation: http://github.com/norman/utf8_utils Regards, Norman -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
Jeremy Kemper
2010-Apr-09 16:47 UTC
Re: [patch] fix tidy_bytes for 1.9.x, improve performance
On Fri, Apr 9, 2010 at 6:05 AM, Norman Clarke <norman@njclarke.com> wrote:> Hi all, > > I was wondering if I could get some feedback on a patch I created for > ActiveSupport''s `tidy_bytes` method. > > Right now `tidy_bytes` doesn''t work with 1.9.x, since it relies on a > Unicode regexp that always fails for strings with invalid UTF-8 > characters. You can see the essence of the problem easily by firing up > any 1.9.x irb and doing this: > > ruby-1.9.2-preview1 > "\x93".split(//u) > ArgumentError: invalid byte sequence in UTF-8 > from (irb):2:in `split'' > from (irb):2 > from > /Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main> > > This patch resolves the issue by traversing the string as bytes rather > than codepoints, and is about twice as fast as the current > implementation. Rather than using the current implementation''s regular > expression, it checks each byte''s first 0 bit to determine its > validity. This Wikipedia article was a useful reference while working > on the patch: > > http://en.wikipedia.org/wiki/UTF-8#Description > > It also adds a `force` option to allow cleanup of byte sequences that > are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when > the developer knows that their input is encoded in CP-1252 or > ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of > invalid characters will prevent doing this by simply using #encode or > #force_encoding on 1.9.) > > * The patch: http://gist.github.com/361115 > * LH Ticket: https://rails.lighthouseapp.com/projects/8994/tickets/4350-tidy_bytes-fails-on-19x > > Here is also a library where you can see this code in isolation: > > http://github.com/norman/utf8_utilsThis is great. Thanks, Norman! jeremy -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com. To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.