thr3ads.net - Rails - [Rails] How do I get substring of utf-8 string? [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Dae San Hwang

2006-Mar-21 02:51 UTC

[Rails] How do I get substring of utf-8 string?

I''m trying to get substring from a utf-8 encoded string.  (say, first  
50 characters of the string)  String#[0..49] would give me the first  
50 bytes not 50 characters..

I know there is jcode library, but it only let you count number of  
characters in utf-8 string.

unicode gem doesn''t seem to help much.  unicode_hacks gem seem to  
solve the problem, but it also seems to change the methods of String  
class directly so that it may confuse rails which expects String#[]  
to give back bytes not characters.

Can somebody point out what should be the route I should take?   
Should I implement substring methods myself?  Have not someone  
already solved this problem?

thanks,

daesan

Alex Zhukov

2006-Mar-22 05:22 UTC

head link

[Rails] How do I get substring of utf-8 string?

If you only need a substring you can fake it with the jcode library.
Use something like this:

$KCODE=''u''
require ''jcode''

class String
   def usubstr a, b
     i = 0
     buff = ''''
     each_char do
       | c |
       i += 1
       if i >= a: buff += c end
       if i == b: return buff end
     end
   end
end

bla = "put here some unicode string"
puts bla.usubstr 6, 10

It works with cyrillic UTF-8 text, should work for other languages too.
I hope this helps.

--
best regards,
Alex Zhukov
baron.pampa@gmail.com



On Mar 21, 2006, at 5:51 AM, Dae San Hwang wrote:
> I''m trying to get substring from a utf-8 encoded string.  (say,  
> first 50 characters of the string)  String#[0..49] would give me  
> the first 50 bytes not 50 characters..
>
> I know there is jcode library, but it only let you count number of  
> characters in utf-8 string.
>
> unicode gem doesn''t seem to help much.  unicode_hacks gem seem to
> solve the problem, but it also seems to change the methods of  
> String class directly so that it may confuse rails which expects  
> String#[] to give back bytes not characters.
>
> Can somebody point out what should be the route I should take?   
> Should I implement substring methods myself?  Have not someone  
> already solved this problem?
>
> thanks,
>
> daesan
> _______________________________________________
> Rails mailing list
> Rails@lists.rubyonrails.org
> http://lists.rubyonrails.org/mailman/listinfo/rails

Dae San Hwang

2006-Mar-22 06:29 UTC

head link

[Rails] How do I get substring of utf-8 string?

Awesome!  I must have overlooked each_char method in jcode library.

In the mean time, I modified unicode_hacks by Julik so that it would  
not overload existing String methods.  I attached the modified source  
code at the end of this email.  UTF-8 compatible equivalent methods  
are prefixed by ''u_''.  For example, length of UTF-8 string is
returned by ''u_length'' method, substring of UTF-8 string is
returned
by ''u_slice'', etc.  This hack requires unicode gem.

Thank you Alex for the tip.  I would use your tip for simpler needs. :-)

daesan

ps: original unicode_hacks is available at http:// 
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/


On Mar 22, 2006, at 2:22 PM, Alex Zhukov wrote:
> If you only need a substring you can fake it with the jcode library.
> Use something like this:
>
> $KCODE=''u''
> require ''jcode''
>
> class String
>   def usubstr a, b
>     i = 0
>     buff = ''''
>     each_char do
>       | c |
>       i += 1
>       if i >= a: buff += c end
>       if i == b: return buff end
>     end
>   end
> end
>
> bla = "put here some unicode string"
> puts bla.usubstr 6, 10
>
> It works with cyrillic UTF-8 text, should work for other languages  
> too.
> I hope this helps.
>
> --
> best regards,
> Alex Zhukov
> baron.pampa@gmail.com
>
>
>
> On Mar 21, 2006, at 5:51 AM, Dae San Hwang wrote:
>
>> I''m trying to get substring from a utf-8 encoded string. 
(say,
>> first 50 characters of the string)  String#[0..49] would give me  
>> the first 50 bytes not 50 characters..
>>
>> I know there is jcode library, but it only let you count number of  
>> characters in utf-8 string.
>>
>> unicode gem doesn''t seem to help much.  unicode_hacks gem seem
to
>> solve the problem, but it also seems to change the methods of  
>> String class directly so that it may confuse rails which expects  
>> String#[] to give back bytes not characters.
>>
>> Can somebody point out what should be the route I should take?   
>> Should I implement substring methods myself?  Have not someone  
>> already solved this problem?
>>
>> thanks,
>>
>> daesan
>> _______________________________________________
>> Rails mailing list
>> Rails@lists.rubyonrails.org
>> http://lists.rubyonrails.org/mailman/listinfo/rails
>
> _______________________________________________
> Rails mailing list
> Rails@lists.rubyonrails.org
> http://lists.rubyonrails.org/mailman/listinfo/rails

# This is a modified version of unicode_hacks so that regular String  
methods are not overloaded.  Instead, UTF-8 compatible equivalent  
methods are prefixed with "u_".

begin
   require ''unicode''
   # Do some SUBSTANTIAL rewiring of the String class. This doesn''t  
solve all of the problems
   # but it does solve some. And it will work in UTF-8 context only,  
so we step aside
   # if $KCODE is not UTF-8 (Japanese people prefr JIS, right?)
   #
   # Following the tradition - I am grateful to Yoshida MASATO for  
the Unicode gem.
   #
   # The core capabilities of String are changed by this module only  
when $KCODE is set to ''UTF8''.
   # Strings start to properly trim, properly strip and size, and do  
many other nice things they have
   # been supposed to do for ages.
   # All "old" byte-oriented methods of Strings are still available  
with "byte_" prefix (i.e. "byte_reverse",
"byte_slice")
   class String
   end

   unless defined?(String::UNICODE_REWIRED) # rewire only once even  
if it''s reloaded
     String.class_eval do

       UNICODE_REWIRED = true

       class <<self
         # Returns a regular expression pattern that matches the  
passed Unicode codepoints
         def codepoints_to_pattern(array_of_codepoints)
           array_of_codepoints.collect{ |e| [e].pack "U*"
}.join(''|'')
         end
       end

       UNICODE_WHITESPACE = [
         (0x0009..0x000D).to_a,  # White_Space # Cc   [5]  
<control-0009>..<control-000D>
         0x0020,          # White_Space # Zs       SPACE
         0x0085,          # White_Space # Cc       <control-0085>
         0x00A0,          # White_Space # Zs       NO-BREAK SPACE
         0x1680,          # White_Space # Zs       OGHAM SPACE MARK
         0x180E,          # White_Space # Zs       MONGOLIAN VOWEL  
SEPARATOR
         (0x2000..0x200A).to_a, # White_Space # Zs  [11] EN  
QUAD..HAIR SPACE
         0x2028,          # White_Space # Zl       LINE SEPARATOR
         0x2029,          # White_Space # Zp       PARAGRAPH SEPARATOR
         0x202F,          # White_Space # Zs       NARROW NO-BREAK SPACE
         0x205F,          # White_Space # Zs       MEDIUM  
MATHEMATICAL SPACE
         0x3000,          # White_Space # Zs       IDEOGRAPHIC SPACE
       ].flatten

       UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279] #  
ZERO-WIDTH NO-BREAK SPACE aka BOM

       # Borrowed from the Kconv library by Shinji KONO - (also as  
seen on the W3C site)
       UTF8_PAT = /\A(?:
                     [\x00-\x7f]                                     |
                     [\xc2-\xdf] [\x80-\xbf]                         |
                     \xe0        [\xa0-\xbf] [\x80-\xbf]             |
                     [\xe1-\xef] [\x80-\xbf] [\x80-\xbf]             |
                     \xf0        [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
                     [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
                     \xf4        [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
                    )*\z/xn


       UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern 
(UNICODE_LEADERS_AND_TRAILERS)})+$/
       UNICODE_LEADERS_PAT = /^(#{codepoints_to_pattern 
(UNICODE_LEADERS_AND_TRAILERS)})+/

       # Performs Unicode-aware conversion to lowercase
       def u_downcase
         return downcase unless utf8_pragma?

         Unicode::downcase(Unicode::normalize_KC(self))
       end

       def u_downcase! #:nodoc:
          self.replace downcase
       end

       # Performs Unicode-aware conversion to UPPERCASE
       def u_upcase
         return upcase unless utf8_pragma?

         Unicode::upcase(Unicode::normalize_KC(self))
       end

       def u_upcase! #:nodoc:
          self.replace upcase
       end

       # Performs Unicode-aware Capitalization
       def u_capitalize
         capitalize unless utf8_pragma?

         Unicode::capitalize(Unicode::normalize_KC(self))
       end

       def u_capitalize! #:nodoc:
          self.replace capitalize
       end

       # Instead of fetching bytes will fetch the string composed of  
codepoints at the specified offsets.
       # The call with a single integer as argument will still return  
a byte.
       # If the string is not a valid UTF-8 sequence bytes will be  
returned
       def u_slice(*args)
         return slice(*args) unless utf8_pragma?

         if (args.size == 2 && args.first.is_a?(Range))
           raise TypeError, ''cannot convert Range into
Integer'' # Do
as if we were native
         elsif (args.first.is_a?(Range) or args.size == 2)
           #normalize to KC so that all combined glyphs are spliced  
together and ligatures split, and then....
           Unicode::normalize_KC(self).unpack("U*").send(:slice,  
*args).pack("U*")
         else
           slice(*args)
         end
       end

       def u_index(*args)
         if (args.first.is_a?(String) and ! 
args.first.has_utf8_semantics?) or !utf8_pragma?
           return index(*args)
         end

         bidx = index(*args)
         return nil unless bidx
         return self.slice(0...bidx).unpack("U*").size
       end

       # Replacement for the lstrip routine. Will first normalize the  
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_strip
         return strip unless utf8_pragma?

         lstrip.rstrip
       end

       # Replacement for the lstrip routine. Will first normalize the  
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_lstrip
         return lstrip unless utf8_pragma?

         gsub(UNICODE_LEADERS_PAT, '''')
       end

       # Replacement for the rstrip routine. Will first normalize the  
string and then remove all Unicode whitespace,
       # including line breaks and nonbreaking spaces
       def u_rstrip
         return rstrip unless utf8_pragma?

         gsub(UNICODE_TRAILERS_PAT, '''')
       end

       def u_lstrip! #:nodoc:
         self.replace lstrip
       end

       def u_rstrip! #:nodoc:
         self.replace rstrip
       end

       def u_strip! #:nodoc:
         self.replace strip
       end

       # Decomposes the string and returns the decomposed string
       def decompose
         Unicode::decompose(self)
       end

       # Normalizes the string to form KC and returns the result
       def normalize_KC
         Unicode::normalize_KC(self)
       end

       # Normalizes the string to form D and returns the result
       def normalize_D
         Unicode::normalize_D(self)
       end

       # Normalizes the string to form C and returns the result
       def normalize_C
         Unicode::normalize_C(self)
       end

       # Provides replacement for the size routine. Will first  
normalize to KC and then return the number
       # of codepoints
       def u_size
         return size unless utf8_pragma?

         #normalize to KC so that all combiner letters are spliced  
together, and then....
         Unicode::normalize_KC(self).unpack("U*").size
       end

       def u_length #:nodoc:
         u_size
       end


       # Provides replacement for the reverse routine. Will first  
normalize to KC and then reverse the resulting
       # codepoints
       def u_reverse
         return reverse unless utf8_pragma?

        
Unicode::normalize_KC(self).unpack("U*").reverse.pack("U*")
       end

       # Inserts the string at codepoint offset specified in offset.
       def u_insert(offset, fragment)
         return insert(offset, fragment) unless utf8_pragma?

         self.replace(unpack("U*").insert(offset, fragment.unpack 
("U*")).flatten.pack("U*"))
       end

       # Returns false or true depending on whether the string has  
UTF-8 semantics (a String used for purely
       # byte resources is unlikely to have them).
       def has_utf8_semantics?
         UTF8_PAT.match(self)
       end

       private
         def utf8_pragma?
           ($KCODE == ''UTF8'') and (self.has_utf8_semantics?)
         end
     end

     if defined?(RAILS_DEFAULT_LOGGER)
       RAILS_DEFAULT_LOGGER.warn "Standard string functions have been  
overloaded with " +
                                 "UTF8-aware versions"
     end
   end
rescue LoadError
   if defined?(RAILS_DEFAULT_LOGGER)
     RAILS_DEFAULT_LOGGER.error "You don''t have the Unicode
library
installed, most string " +
                                "operations will stay single-byte"
   end
end

Apparently Analagous Threads

Search for more possibly parallel threads

Rails - Mar 2006 - How do I get substring of utf-8 string?

[Rails] How do I get substring of utf-8 string?

[Rails] How do I get substring of utf-8 string?

[Rails] How do I get substring of utf-8 string?

Apparently Analagous Threads