Awesome! I must have overlooked each_char method in jcode library.
In the mean time, I modified unicode_hacks by Julik so that it would
not overload existing String methods. I attached the modified source
code at the end of this email. UTF-8 compatible equivalent methods
are prefixed by ''u_''. For example, length of UTF-8 string is
returned by ''u_length'' method, substring of UTF-8 string is
returned
by ''u_slice'', etc. This hack requires unicode gem.
Thank you Alex for the tip. I would use your tip for simpler needs. :-)
daesan
ps: original unicode_hacks is available at http://
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
On Mar 22, 2006, at 2:22 PM, Alex Zhukov wrote:
> If you only need a substring you can fake it with the jcode library.
> Use something like this:
>
> $KCODE=''u''
> require ''jcode''
>
> class String
> def usubstr a, b
> i = 0
> buff = ''''
> each_char do
> | c |
> i += 1
> if i >= a: buff += c end
> if i == b: return buff end
> end
> end
> end
>
> bla = "put here some unicode string"
> puts bla.usubstr 6, 10
>
> It works with cyrillic UTF-8 text, should work for other languages
> too.
> I hope this helps.
>
> --
> best regards,
> Alex Zhukov
> baron.pampa@gmail.com
>
>
>
> On Mar 21, 2006, at 5:51 AM, Dae San Hwang wrote:
>
>> I''m trying to get substring from a utf-8 encoded string.
(say,
>> first 50 characters of the string) String#[0..49] would give me
>> the first 50 bytes not 50 characters..
>>
>> I know there is jcode library, but it only let you count number of
>> characters in utf-8 string.
>>
>> unicode gem doesn''t seem to help much. unicode_hacks gem seem
to
>> solve the problem, but it also seems to change the methods of
>> String class directly so that it may confuse rails which expects
>> String#[] to give back bytes not characters.
>>
>> Can somebody point out what should be the route I should take?
>> Should I implement substring methods myself? Have not someone
>> already solved this problem?
>>
>> thanks,
>>
>> daesan
>> _______________________________________________
>> Rails mailing list
>> Rails@lists.rubyonrails.org
>> http://lists.rubyonrails.org/mailman/listinfo/rails
>
> _______________________________________________
> Rails mailing list
> Rails@lists.rubyonrails.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
# This is a modified version of unicode_hacks so that regular String
methods are not overloaded. Instead, UTF-8 compatible equivalent
methods are prefixed with "u_".
begin
require ''unicode''
# Do some SUBSTANTIAL rewiring of the String class. This doesn''t
solve all of the problems
# but it does solve some. And it will work in UTF-8 context only,
so we step aside
# if $KCODE is not UTF-8 (Japanese people prefr JIS, right?)
#
# Following the tradition - I am grateful to Yoshida MASATO for
the Unicode gem.
#
# The core capabilities of String are changed by this module only
when $KCODE is set to ''UTF8''.
# Strings start to properly trim, properly strip and size, and do
many other nice things they have
# been supposed to do for ages.
# All "old" byte-oriented methods of Strings are still available
with "byte_" prefix (i.e. "byte_reverse",
"byte_slice")
class String
end
unless defined?(String::UNICODE_REWIRED) # rewire only once even
if it''s reloaded
String.class_eval do
UNICODE_REWIRED = true
class <<self
# Returns a regular expression pattern that matches the
passed Unicode codepoints
def codepoints_to_pattern(array_of_codepoints)
array_of_codepoints.collect{ |e| [e].pack "U*"
}.join(''|'')
end
end
UNICODE_WHITESPACE = [
(0x0009..0x000D).to_a, # White_Space # Cc [5]
<control-0009>..<control-000D>
0x0020, # White_Space # Zs SPACE
0x0085, # White_Space # Cc <control-0085>
0x00A0, # White_Space # Zs NO-BREAK SPACE
0x1680, # White_Space # Zs OGHAM SPACE MARK
0x180E, # White_Space # Zs MONGOLIAN VOWEL
SEPARATOR
(0x2000..0x200A).to_a, # White_Space # Zs [11] EN
QUAD..HAIR SPACE
0x2028, # White_Space # Zl LINE SEPARATOR
0x2029, # White_Space # Zp PARAGRAPH SEPARATOR
0x202F, # White_Space # Zs NARROW NO-BREAK SPACE
0x205F, # White_Space # Zs MEDIUM
MATHEMATICAL SPACE
0x3000, # White_Space # Zs IDEOGRAPHIC SPACE
].flatten
UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279] #
ZERO-WIDTH NO-BREAK SPACE aka BOM
# Borrowed from the Kconv library by Shinji KONO - (also as
seen on the W3C site)
UTF8_PAT = /\A(?:
[\x00-\x7f] |
[\xc2-\xdf] [\x80-\xbf] |
\xe0 [\xa0-\xbf] [\x80-\xbf] |
[\xe1-\xef] [\x80-\xbf] [\x80-\xbf] |
\xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
[\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
\xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
)*\z/xn
UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+$/
UNICODE_LEADERS_PAT = /^(#{codepoints_to_pattern
(UNICODE_LEADERS_AND_TRAILERS)})+/
# Performs Unicode-aware conversion to lowercase
def u_downcase
return downcase unless utf8_pragma?
Unicode::downcase(Unicode::normalize_KC(self))
end
def u_downcase! #:nodoc:
self.replace downcase
end
# Performs Unicode-aware conversion to UPPERCASE
def u_upcase
return upcase unless utf8_pragma?
Unicode::upcase(Unicode::normalize_KC(self))
end
def u_upcase! #:nodoc:
self.replace upcase
end
# Performs Unicode-aware Capitalization
def u_capitalize
capitalize unless utf8_pragma?
Unicode::capitalize(Unicode::normalize_KC(self))
end
def u_capitalize! #:nodoc:
self.replace capitalize
end
# Instead of fetching bytes will fetch the string composed of
codepoints at the specified offsets.
# The call with a single integer as argument will still return
a byte.
# If the string is not a valid UTF-8 sequence bytes will be
returned
def u_slice(*args)
return slice(*args) unless utf8_pragma?
if (args.size == 2 && args.first.is_a?(Range))
raise TypeError, ''cannot convert Range into
Integer'' # Do
as if we were native
elsif (args.first.is_a?(Range) or args.size == 2)
#normalize to KC so that all combined glyphs are spliced
together and ligatures split, and then....
Unicode::normalize_KC(self).unpack("U*").send(:slice,
*args).pack("U*")
else
slice(*args)
end
end
def u_index(*args)
if (args.first.is_a?(String) and !
args.first.has_utf8_semantics?) or !utf8_pragma?
return index(*args)
end
bidx = index(*args)
return nil unless bidx
return self.slice(0...bidx).unpack("U*").size
end
# Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_strip
return strip unless utf8_pragma?
lstrip.rstrip
end
# Replacement for the lstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_lstrip
return lstrip unless utf8_pragma?
gsub(UNICODE_LEADERS_PAT, '''')
end
# Replacement for the rstrip routine. Will first normalize the
string and then remove all Unicode whitespace,
# including line breaks and nonbreaking spaces
def u_rstrip
return rstrip unless utf8_pragma?
gsub(UNICODE_TRAILERS_PAT, '''')
end
def u_lstrip! #:nodoc:
self.replace lstrip
end
def u_rstrip! #:nodoc:
self.replace rstrip
end
def u_strip! #:nodoc:
self.replace strip
end
# Decomposes the string and returns the decomposed string
def decompose
Unicode::decompose(self)
end
# Normalizes the string to form KC and returns the result
def normalize_KC
Unicode::normalize_KC(self)
end
# Normalizes the string to form D and returns the result
def normalize_D
Unicode::normalize_D(self)
end
# Normalizes the string to form C and returns the result
def normalize_C
Unicode::normalize_C(self)
end
# Provides replacement for the size routine. Will first
normalize to KC and then return the number
# of codepoints
def u_size
return size unless utf8_pragma?
#normalize to KC so that all combiner letters are spliced
together, and then....
Unicode::normalize_KC(self).unpack("U*").size
end
def u_length #:nodoc:
u_size
end
# Provides replacement for the reverse routine. Will first
normalize to KC and then reverse the resulting
# codepoints
def u_reverse
return reverse unless utf8_pragma?
Unicode::normalize_KC(self).unpack("U*").reverse.pack("U*")
end
# Inserts the string at codepoint offset specified in offset.
def u_insert(offset, fragment)
return insert(offset, fragment) unless utf8_pragma?
self.replace(unpack("U*").insert(offset, fragment.unpack
("U*")).flatten.pack("U*"))
end
# Returns false or true depending on whether the string has
UTF-8 semantics (a String used for purely
# byte resources is unlikely to have them).
def has_utf8_semantics?
UTF8_PAT.match(self)
end
private
def utf8_pragma?
($KCODE == ''UTF8'') and (self.has_utf8_semantics?)
end
end
if defined?(RAILS_DEFAULT_LOGGER)
RAILS_DEFAULT_LOGGER.warn "Standard string functions have been
overloaded with " +
"UTF8-aware versions"
end
end
rescue LoadError
if defined?(RAILS_DEFAULT_LOGGER)
RAILS_DEFAULT_LOGGER.error "You don''t have the Unicode
library
installed, most string " +
"operations will stay single-byte"
end
end