thr3ads.net - Rails - Regexp and non-ascii characters [Oct 2005]

If this information is useful, please help other people find it:
Share via:

Jaroslaw Zabiello

2005-Oct-10 20:16 UTC

Regexp and non-ascii characters

I do not know how to make Ruby to recognize \w for non-ascii
characters. Let assume I have some strings in cp1250 encodings. In
Python I can convert it to unicode object and I can use regexp for it.
Let's compare:

Python:
>>> txt = 'abc łąka def' # cp1250 text
>>> txt
'abc \xb3\xb9ka def'>>> txt = unicode(txt, 'cp1250') # change to unicode object
>>> txt
u'abc \u0142\u0105ka def'>>> import re
>>> re.compile('(\w+)', re.U).findall(txt)[u'abc', u'\u0142\u0105ka', u'def']

BINGO! It could find the middle word.

Ruby:

require 'iconv'
=> true
txt = 'abc łąka def'
=> "abc \263\271ka def"
txt =  Iconv.conv('utf-8', 'cp1250', txt)
=> "abc \305\202\304\205ka def"
pattern = Regexp.new(/(\w+)/, true, 'U')
 txt.scan(pattern)
=> [["abc"], ["ka"], ["def"]]

Failed! :( Ruby cannot find non ascii characters. It cannot see tham
at all (sic!)

Is there any solution for Ruby?

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Jaroslaw Zabiello

2005-Oct-10 20:32 UTC

head link

Regexp and non-ascii characters

I do not know how to make Ruby to recognize \w for non-ascii
characters. Let assume I have some strings in cp1250 encodings. In
Python I can convert it to unicode object and I can use regexp for it.
Let's compare:

Python:
>>> txt = 'abc łąka def' # cp1250 text
>>> txt
'abc \xb3\xb9ka def'>>> txt = unicode(txt, 'cp1250') # change to unicode object
>>> txt
u'abc \u0142\u0105ka def'>>> import re
>>> re.compile('(\w+)', re.U).findall(txt)[u'abc', u'\u0142\u0105ka', u'def']

BINGO! It could find the middle word.

Ruby:

require 'iconv'
=> true
txt = 'abc łąka def'
=> "abc \263\271ka def"
txt =  Iconv.conv('utf-8', 'cp1250', txt)
=> "abc \305\202\304\205ka def"
pattern = Regexp.new(/(\w+)/, true, 'U')
  txt.scan(pattern)
=> [["abc"], ["ka"], ["def"]]

Failed! :( Ruby cannot find non ascii characters. It cannot see tham
at all (sic!)

Is there any solution for Ruby?

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Alex Young

2005-Oct-10 20:44 UTC

head link

Re: Regexp and non-ascii characters

Do this first:

$KCODE=''u''
require ''jcode''

then> require ''iconv''
> => true
> txt = ''abc łąka def''
> => "abc \263\271ka def"
> txt =  Iconv.conv(''utf-8'', ''cp1250'',
txt)
> => "abc \305\202\304\205ka def"
> pattern = Regexp.new(/(\w+)/, true, ''U'')
>  txt.scan(pattern)=> [["abc"], ["┼é─àka"], ["def"]]

Was that what you were after?

-- 
Alex

Julian ''Julik'' Tarkhanov

2005-Oct-10 21:57 UTC

head link

Re: Regexp and non-ascii characters

On 10-okt-2005, at 22:32, Jaroslaw Zabiello wrote:
>
> Failed! :( Ruby cannot find non ascii characters. It cannot see tham
> at all (sic!)
>
> Is there any solution for Ruby?
1. Convert your strings once and for all. Get rid of them. Go  
Unicode. 8-bit is EVIL. Let it die in flames in your backyard.  
Usually it''s 3 commands in the terminal.

2. $KCODE = ''u''
     require ''jcode''
     Some advise that jcode is bad but I don''t anticipate all the  
hoops you have to go through if you don''t use it

3. "stuff" =~ /uff/u => true

Don''t forget the /u key. It works quite well and I ported an HTML  
beautification library from PHP using that.

N.B. don''t expect anyhing to work with old 8-byte charsets. I hope  
they will all break one day so that nobody dares to use them anymore.
-- 
Julian "Julik" Tarkhanov

Jaroslaw Zabiello

2005-Oct-10 23:00 UTC

head link

Re: Regexp and non-ascii characters

2005/10/10, Alex Young <alex@blackkettle.org>:
...> Was that what you were after?
My real problems begin when I want to find words with mixing lower and
upper case. Once again.

Python:

import re
s = 'abcd łąka def'
u = unicode(s, 'cp1250')
re.compile(unicode('łąka', 'cp1250'), re.U|re.I).findall(u)
[u'\u0142\u0105ka']

and

re.compile(unicode('łĄka', 'cp1250'), re.U|re.I).findall(u)
[u'\u0142\u0105ka']

Lower or upper case does not matter. This is what I wanted.

I need that feature for my text searcher. I want to display searched
phrases in text even if they contains different upper/lower case of
some characters.


Ruby:

$KCODE='u'
require 'jcode'
require 'iconv'
s = 'abc łąka def'
u = Iconv.conv('utf-8', 'cp1250', s)
pattern = Regexp.new(Iconv.conv('utf-8', 'cp1250',
'łąka'), true, 'U')
u.scan(pattern)
=> ["Ĺ‚Ä…ka"]
pattern = Regexp.new(Iconv.conv('utf-8', 'cp1250',
'łĄka'), true, 'U')
u.scan(pattern)
=> []

Failled :( It could not find phrases when I changed from lower to
upper case one non-ascii letter.

--
JZ

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Julian ''Julik'' Tarkhanov

2005-Oct-11 01:10 UTC

head link

Re: Regexp and non-ascii characters

On 11-okt-2005, at 1:00, Jaroslaw Zabiello wrote:
> 2005/10/10, Alex Young
<alex-qV/boFbD8Meu8LGVeLuP/g@public.gmane.org>:
> ...
>
>> Was that what you were after?
>>
>
> My real problems begin when I want to find words with mixing lower and
> upper case. Once again.
$KCODE = ''u''
require ''jcode''

"SomeWicKEDШтучКА" =~ /учк/ui

Please read Pickaxe.
If you need to compile a regex with strings in it you can use the  
literal form of

/foo#{method_call_or_whatever}/ui

-- 
Julian "Julik" Tarkhanov

Jaroslaw Zabiello

2005-Oct-11 07:37 UTC

head link

Re: Regexp and non-ascii characters

2005/10/11, Julian 'Julik' Tarkhanov <listbox@julik.nl>:
> $KCODE = 'u'
> require 'jcode'
>
> "SomeWicKEDШтучКА" =~ /учк/ui
>
> Please read Pickaxe.
> If you need to compile a regex with strings in it you can use the
> literal form of
>
> /foo#{method_call_or_whatever}/ui
It would be not enough I am afraid. I checked your example:

$KCODE = 'u'
require 'jcode'

if "SomeWicKEDШтучКА" =~ /учк/ui
    puts 'FOUND'
else
    puts 'NOT FOUND'
end

The result is "NOT FOUND" :(

--
JZ

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Julian ''Julik'' Tarkhanov

2005-Oct-11 12:57 UTC

head link

Re: Regexp and non-ascii characters

On 11-okt-2005, at 9:37, Jaroslaw Zabiello wrote:
> $KCODE = ''u''
> require ''jcode''
>
> if "SomeWicKEDШтучКА" =~ /учк/ui
>     puts ''FOUND''
> else
>     puts ''NOT FOUND''
> end
Oh crap, I am drowning in my own arrogance. Indeed, /i key does not  
work for Unicode (not in ruby and not in - surprise - PHP).

I wonder how many ways exist for a scripting language to break  
Unicode - for now only Perl and Python at least seem to do it  
properly. I hope (sigh) that this is top-priority for YARV instead of  
(sigh) reimplementing block scopes and whatnot.

What you really need is this..
1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicode

2. require ''unicode'';
Unicode::downcase(string).sub(/your_pattern/u)
etc - insert replacements into the text at positions where your  
pattern is found.

Unicode::capitalize(), same::upcase() and the rest are not to miss.

If you intend to distribute yur application you can make a stub for  
those who don''t have Unicode available:

begin
   require ''unicode''
rescue LoadError
   RAILS_DEFAULT_LOGGER.error "You don''t have Unicode gem
installed,
capitalizations will be single-byte and case conversion might be broken"
   module Unicode
     def self.method_missing(meth, *args)
       args[0].send(meth)
     end
   end
end

Unicode support in Ruby is almost non-existent

-- 
Julian "Julik" Tarkhanov

Jaroslaw Zabiello

2005-Oct-11 15:56 UTC

head link

Re: Regexp and non-ascii characters

2005/10/11, Julian ''Julik'' Tarkhanov
<listbox-RY+snkucC20@public.gmane.org>:
> What you really need is this..
> 1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicode
But I am using WinXP box and I would need binary library... :(
> Unicode support in Ruby is almost non-existent
It''s a very bad news because I started to make application where
unicode plays very important role.

--
JZ

Julian ''Julik'' Tarkhanov

2005-Oct-11 18:58 UTC

head link

Re: Regexp and non-ascii characters

On 11-okt-2005, at 17:56, Jaroslaw Zabiello wrote:
> 2005/10/11, Julian ''Julik'' Tarkhanov
<listbox-RY+snkucC20@public.gmane.org>:
>
>
>> What you really need is this..
>> 1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicode
>>
>
> But I am using WinXP box and I would need binary library... :(
>
>
>> Unicode support in Ruby is almost non-existent
>>
>
> It''s a very bad news because I started to make application where
> unicode plays very important role.
Well, I guess it is. But I am using the Unicode gem and I am happily  
deploying
applications that use russian strings. Everyywhere.  You also have  
Locale module which gives you localized date and currency formatting  
etc. - together with Unicode gem and Iconv there was nothing I wasn''t  
yet able to tackle.

If you still want to go the Rails route I think you can find folks on  
ruby-lang who will help you compile the needed libs. Or go UNIX.

-- 
Julian "Julik" Tarkhanov

Rails - Oct 2005 - Regexp and non-ascii characters

Regexp and non-ascii characters

Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters

Re: Regexp and non-ascii characters